No recovery when trying to replicate on marginal datanode
---------------------------------------------------------
Key: HADOOP-1998
URL: https://issues.apache.org/jira/browse/HADOOP-1998
Project: Hadoop
Issue Type: Bug
Components: dfs
Affects Versions: 0.15.0
Environment: Sep 14 nightly build with a couple of mapred-related
patches
Reporter: Christian Kunz
We have been uploading a lot of data to hdfs, running about 400 scripts in
parallel calling hadoop's command line utility in distributed fashion. Many of
them started to hang when copying large files (>120GB), repeating the following
messages without end:
07/10/05 15:44:25 INFO fs.DFSClient: Could not complete file, retrying...
07/10/05 15:44:26 INFO fs.DFSClient: Could not complete file, retrying...
07/10/05 15:44:26 INFO fs.DFSClient: Could not complete file, retrying...
07/10/05 15:44:27 INFO fs.DFSClient: Could not complete file, retrying...
07/10/05 15:44:27 INFO fs.DFSClient: Could not complete file, retrying...
07/10/05 15:44:28 INFO fs.DFSClient: Could not complete file, retrying...
In the namenode log I eventually found repeated messages like:
2007-10-05 14:40:08,063 WARN org.apache.hadoop.fs.FSNamesystem:
PendingReplicationMonitor timed out block blk_3124504920241431462
2007-10-05 14:40:11,876 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
NameSystem.pendingTransfer: ask <IP4>50010 to replicate blk_3124504920241431462
to datanode(s) <IP4_1>:50010
2007-10-05 14:45:08,069 WARN org.apache.hadoop.fs.FSNamesystem:
PendingReplicationMonitor timed out block blk_8533614499490422104
2007-10-05 14:45:08,070 WARN org.apache.hadoop.fs.FSNamesystem:
PendingReplicationMonitor timed out block blk_7741954594593177224
2007-10-05 14:45:13,973 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
NameSystem.pendingTransfer: ask <IP4>:50010 to replicate
blk_7741954594593177224 to datanode(s) <IP4_2>:50010
2007-10-05 14:45:13,973 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
NameSystem.pendingTransfer: ask <IP4>:50010 to replicate
blk_8533614499490422104 to datanode(s) <IP4_3>50010
I could not ssh to the node with IpAdress <IP4>, but seemingly the datanode
server still sent heartbeats. After rebooting the node it was okay again and a
few files and a few clients recovered, but not all.
I restarted these clients and they completed this time (before noticing the
marginal node we restarted the clients twice without success).
I would conclude that the existence of the marginal node must have caused loss
of blocks, at least in the tracking mechanism, in addition to eternal retries.
In summary, dfs should be able to handle datanodes with good heartbeat but
otherwise failing to do their job. This should include datanodes that have a
high rate of socket connection timeouts.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.