[ https://issues.apache.org/jira/browse/HDFS-59?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Harsh J resolved HDFS-59. ------------------------- Resolution: Not A Problem This has gone stale. We haven't seen this lately. Lets file a new one if we see this again (these days it errors out with 'Could only replicate to X nodes' kinda errors). Also, could've been your dfs.replication.min > 1. > No recovery when trying to replicate on marginal datanode > --------------------------------------------------------- > > Key: HDFS-59 > URL: https://issues.apache.org/jira/browse/HDFS-59 > Project: Hadoop HDFS > Issue Type: Bug > Environment: Sep 14 nightly build with a couple of mapred-related > patches > Reporter: Christian Kunz > > We have been uploading a lot of data to hdfs, running about 400 scripts in > parallel calling hadoop's command line utility in distributed fashion. Many > of them started to hang when copying large files (>120GB), repeating the > following messages without end: > 07/10/05 15:44:25 INFO fs.DFSClient: Could not complete file, retrying... > 07/10/05 15:44:26 INFO fs.DFSClient: Could not complete file, retrying... > 07/10/05 15:44:26 INFO fs.DFSClient: Could not complete file, retrying... > 07/10/05 15:44:27 INFO fs.DFSClient: Could not complete file, retrying... > 07/10/05 15:44:27 INFO fs.DFSClient: Could not complete file, retrying... > 07/10/05 15:44:28 INFO fs.DFSClient: Could not complete file, retrying... > In the namenode log I eventually found repeated messages like: > 2007-10-05 14:40:08,063 WARN org.apache.hadoop.fs.FSNamesystem: > PendingReplicationMonitor timed out block blk_3124504920241431462 > 2007-10-05 14:40:11,876 INFO org.apache.hadoop.dfs.StateChange: BLOCK* > NameSystem.pendingTransfer: ask <IP4>50010 to replicate > blk_3124504920241431462 to datanode(s) <IP4_1>:50010 > 2007-10-05 14:45:08,069 WARN org.apache.hadoop.fs.FSNamesystem: > PendingReplicationMonitor timed out block blk_8533614499490422104 > 2007-10-05 14:45:08,070 WARN org.apache.hadoop.fs.FSNamesystem: > PendingReplicationMonitor timed out block blk_7741954594593177224 > 2007-10-05 14:45:13,973 INFO org.apache.hadoop.dfs.StateChange: BLOCK* > NameSystem.pendingTransfer: ask <IP4>:50010 to replicate > blk_7741954594593177224 to datanode(s) <IP4_2>:50010 > 2007-10-05 14:45:13,973 INFO org.apache.hadoop.dfs.StateChange: BLOCK* > NameSystem.pendingTransfer: ask <IP4>:50010 to replicate > blk_8533614499490422104 to datanode(s) <IP4_3>50010 > I could not ssh to the node with IpAdress <IP4>, but seemingly the datanode > server still sent heartbeats. After rebooting the node it was okay again and > a few files and a few clients recovered, but not all. > I restarted these clients and they completed this time (before noticing the > marginal node we restarted the clients twice without success). > I would conclude that the existence of the marginal node must have caused > loss of blocks, at least in the tracking mechanism, in addition to eternal > retries. > In summary, dfs should be able to handle datanodes with good heartbeat but > otherwise failing to do their job. This should include datanodes that have a > high rate of socket connection timeouts. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira