Replication Monitor timing out repeatedly
-----------------------------------------
Key: HADOOP-2763
URL: https://issues.apache.org/jira/browse/HADOOP-2763
Project: Hadoop Core
Issue Type: Bug
Components: dfs
Affects Versions: 0.16.0
Environment: Jan 28 nightly build
With patches 2095, 2119, and 2723
Reporter: Christian Kunz
I upgraded a Hadoop installation to the Jan 28 nightly build.
DFS contains 5+ M files.
Fsck reported 1 hour after leaving safemode, 5274 under-replicated blocks with
25 single replications, 3 hours later 433 under-replicated with still 20 single
replications.
The namenode log shows repeated timeouts of the replication monitor for the
same blocks:
2008-02-01 03:41:24,184 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
NameSystem.pendingTransfer: ask datanode to replicate blk_2984271423661664080
to datanode(s) datanode1 datanode2
2008-02-01 03:51:14,104 WARN org.apache.hadoop.fs.FSNamesystem:
PendingReplicationMonitor timed out block blk_2984271423661664080
2008-02-01 03:51:22,303 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
NameSystem.pendingTransfer: ask datanode to replicate blk_2984271423661664080
to datanode(s) datanode3 datanode4
2008-02-01 04:01:14,150 WARN org.apache.hadoop.fs.FSNamesystem:
PendingReplicationMonitor timed out block blk_2984271423661664080
2008-02-01 04:01:19,344 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
NameSystem.pendingTransfer: ask datanode to replicate blk_2984271423661664080
to datanode(s) datanode5 datanode6
...
The datanode seems to be successfully transmitting the blocks:
2008-02-01 03:42:06,284 INFO org.apache.hadoop.dfs.DataNode: datanode Starting
thread to transfer block blk_2984271423661664080 to datanode1, datannode2
2008-02-01 03:42:09,535 INFO org.apache.hadoop.dfs.DataNode:
datanode:Transmitted block blk_2984271423661664080 to /datanode1
2008-02-01 03:52:06,238 INFO org.apache.hadoop.dfs.DataNode: datanode Starting
thread to transfer block blk_2984271423661664080 to datanode3,datanode4
2008-02-01 03:52:09,470 INFO org.apache.hadoop.dfs.DataNode:
datanode:Transmitted block blk_2984271423661664080 to /datanode3
The destination datanodes seem to have problems receiving these blocks (some
time later for a different attempt):
2008-02-01 06:43:06,541 INFO org.apache.hadoop.dfs.DataNode: Receiving block
blk_2984271423661664080 from /datanode
2008-02-01 06:43:09,647 INFO org.apache.hadoop.dfs.DataNode: Exception in
receiveBlock for block blk_2984271423661664080 java.net.SocketException:
Connection reset
2008-02-01 06:43:09,647 INFO org.apache.hadoop.dfs.DataNode: writeBlock
blk_2984271423661664080 received exception java.net.SocketException: Connection
reset
But I was successfully transferring the block between the two datanodes using
scp.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.