No recovery when trying to replicate on marginal datanode ---------------------------------------------------------
Key: HADOOP-1998 URL: https://issues.apache.org/jira/browse/HADOOP-1998 Project: Hadoop Issue Type: Bug Components: dfs Affects Versions: 0.15.0 Environment: Sep 14 nightly build with a couple of mapred-related patches Reporter: Christian Kunz We have been uploading a lot of data to hdfs, running about 400 scripts in parallel calling hadoop's command line utility in distributed fashion. Many of them started to hang when copying large files (>120GB), repeating the following messages without end: 07/10/05 15:44:25 INFO fs.DFSClient: Could not complete file, retrying... 07/10/05 15:44:26 INFO fs.DFSClient: Could not complete file, retrying... 07/10/05 15:44:26 INFO fs.DFSClient: Could not complete file, retrying... 07/10/05 15:44:27 INFO fs.DFSClient: Could not complete file, retrying... 07/10/05 15:44:27 INFO fs.DFSClient: Could not complete file, retrying... 07/10/05 15:44:28 INFO fs.DFSClient: Could not complete file, retrying... In the namenode log I eventually found repeated messages like: 2007-10-05 14:40:08,063 WARN org.apache.hadoop.fs.FSNamesystem: PendingReplicationMonitor timed out block blk_3124504920241431462 2007-10-05 14:40:11,876 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.pendingTransfer: ask <IP4>50010 to replicate blk_3124504920241431462 to datanode(s) <IP4_1>:50010 2007-10-05 14:45:08,069 WARN org.apache.hadoop.fs.FSNamesystem: PendingReplicationMonitor timed out block blk_8533614499490422104 2007-10-05 14:45:08,070 WARN org.apache.hadoop.fs.FSNamesystem: PendingReplicationMonitor timed out block blk_7741954594593177224 2007-10-05 14:45:13,973 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.pendingTransfer: ask <IP4>:50010 to replicate blk_7741954594593177224 to datanode(s) <IP4_2>:50010 2007-10-05 14:45:13,973 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.pendingTransfer: ask <IP4>:50010 to replicate blk_8533614499490422104 to datanode(s) <IP4_3>50010 I could not ssh to the node with IpAdress <IP4>, but seemingly the datanode server still sent heartbeats. After rebooting the node it was okay again and a few files and a few clients recovered, but not all. I restarted these clients and they completed this time (before noticing the marginal node we restarted the clients twice without success). I would conclude that the existence of the marginal node must have caused loss of blocks, at least in the tracking mechanism, in addition to eternal retries. In summary, dfs should be able to handle datanodes with good heartbeat but otherwise failing to do their job. This should include datanodes that have a high rate of socket connection timeouts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.