[jira] Commented: (HADOOP-1998) No recovery when trying to replicate on marginal datanode

Christian Kunz (JIRA) Fri, 05 Oct 2007 16:35:12 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-1998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12532795
 ]


Christian Kunz commented on HADOOP-1998:
----------------------------------------

Number of clients that got stuck were about 5%.

> No recovery when trying to replicate on marginal datanode
> ---------------------------------------------------------
>
>                 Key: HADOOP-1998
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1998
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.15.0
>         Environment: Sep 14 nightly build with a couple of mapred-related 
> patches
>            Reporter: Christian Kunz
>
> We have been uploading a lot of data to hdfs, running about 400 scripts in 
> parallel calling hadoop's command line utility in distributed fashion. Many 
> of them started to hang when copying large files (>120GB), repeating the 
> following messages without end:
> 07/10/05 15:44:25 INFO fs.DFSClient: Could not complete file, retrying...
> 07/10/05 15:44:26 INFO fs.DFSClient: Could not complete file, retrying...
> 07/10/05 15:44:26 INFO fs.DFSClient: Could not complete file, retrying...
> 07/10/05 15:44:27 INFO fs.DFSClient: Could not complete file, retrying...
> 07/10/05 15:44:27 INFO fs.DFSClient: Could not complete file, retrying...
> 07/10/05 15:44:28 INFO fs.DFSClient: Could not complete file, retrying...
> In the namenode log I eventually found repeated messages like:
> 2007-10-05 14:40:08,063 WARN org.apache.hadoop.fs.FSNamesystem: 
> PendingReplicationMonitor timed out block blk_3124504920241431462
> 2007-10-05 14:40:11,876 INFO org.apache.hadoop.dfs.StateChange: BLOCK* 
> NameSystem.pendingTransfer: ask <IP4>50010 to replicate 
> blk_3124504920241431462 to datanode(s) <IP4_1>:50010
> 2007-10-05 14:45:08,069 WARN org.apache.hadoop.fs.FSNamesystem: 
> PendingReplicationMonitor timed out block blk_8533614499490422104
> 2007-10-05 14:45:08,070 WARN org.apache.hadoop.fs.FSNamesystem: 
> PendingReplicationMonitor timed out block blk_7741954594593177224
> 2007-10-05 14:45:13,973 INFO org.apache.hadoop.dfs.StateChange: BLOCK* 
> NameSystem.pendingTransfer: ask <IP4>:50010 to replicate 
> blk_7741954594593177224 to datanode(s) <IP4_2>:50010
> 2007-10-05 14:45:13,973 INFO org.apache.hadoop.dfs.StateChange: BLOCK* 
> NameSystem.pendingTransfer: ask <IP4>:50010 to replicate 
> blk_8533614499490422104 to datanode(s) <IP4_3>50010
> I could not ssh to the  node with IpAdress <IP4>, but seemingly the datanode 
> server still sent heartbeats. After rebooting the node it  was okay again and 
> a few files and a few clients recovered, but not all.
> I restarted these clients and they completed this time (before noticing the 
> marginal node we restarted the clients twice without success).
> I would conclude that the existence of the marginal node must have caused 
> loss of blocks, at least in the tracking mechanism, in addition to eternal 
> retries.
> In summary, dfs should be able to handle datanodes with good heartbeat but 
> otherwise failing to do their job. This should include datanodes that have a 
> high rate of socket connection timeouts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1998) No recovery when trying to replicate on marginal datanode

Reply via email to