No recovery when trying to replicate on marginal datanode
---------------------------------------------------------

                 Key: HADOOP-1998
                 URL: https://issues.apache.org/jira/browse/HADOOP-1998
             Project: Hadoop
          Issue Type: Bug
          Components: dfs
    Affects Versions: 0.15.0
         Environment: Sep 14 nightly build with a couple of mapred-related 
patches
            Reporter: Christian Kunz


We have been uploading a lot of data to hdfs, running about 400 scripts in 
parallel calling hadoop's command line utility in distributed fashion. Many of 
them started to hang when copying large files (>120GB), repeating the following 
messages without end:

07/10/05 15:44:25 INFO fs.DFSClient: Could not complete file, retrying...
07/10/05 15:44:26 INFO fs.DFSClient: Could not complete file, retrying...
07/10/05 15:44:26 INFO fs.DFSClient: Could not complete file, retrying...
07/10/05 15:44:27 INFO fs.DFSClient: Could not complete file, retrying...
07/10/05 15:44:27 INFO fs.DFSClient: Could not complete file, retrying...
07/10/05 15:44:28 INFO fs.DFSClient: Could not complete file, retrying...

In the namenode log I eventually found repeated messages like:

2007-10-05 14:40:08,063 WARN org.apache.hadoop.fs.FSNamesystem: 
PendingReplicationMonitor timed out block blk_3124504920241431462
2007-10-05 14:40:11,876 INFO org.apache.hadoop.dfs.StateChange: BLOCK* 
NameSystem.pendingTransfer: ask <IP4>50010 to replicate blk_3124504920241431462 
to datanode(s) <IP4_1>:50010
2007-10-05 14:45:08,069 WARN org.apache.hadoop.fs.FSNamesystem: 
PendingReplicationMonitor timed out block blk_8533614499490422104
2007-10-05 14:45:08,070 WARN org.apache.hadoop.fs.FSNamesystem: 
PendingReplicationMonitor timed out block blk_7741954594593177224
2007-10-05 14:45:13,973 INFO org.apache.hadoop.dfs.StateChange: BLOCK* 
NameSystem.pendingTransfer: ask <IP4>:50010 to replicate 
blk_7741954594593177224 to datanode(s) <IP4_2>:50010
2007-10-05 14:45:13,973 INFO org.apache.hadoop.dfs.StateChange: BLOCK* 
NameSystem.pendingTransfer: ask <IP4>:50010 to replicate 
blk_8533614499490422104 to datanode(s) <IP4_3>50010

I could not ssh to the  node with IpAdress <IP4>, but seemingly the datanode 
server still sent heartbeats. After rebooting the node it  was okay again and a 
few files and a few clients recovered, but not all.
I restarted these clients and they completed this time (before noticing the 
marginal node we restarted the clients twice without success).

I would conclude that the existence of the marginal node must have caused loss 
of blocks, at least in the tracking mechanism, in addition to eternal retries.

In summary, dfs should be able to handle datanodes with good heartbeat but 
otherwise failing to do their job. This should include datanodes that have a 
high rate of socket connection timeouts.




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to