[jira] Created: (HADOOP-2763) Replication Monitor timing out repeatedly

Christian Kunz (JIRA) Thu, 31 Jan 2008 23:07:31 -0800

Replication Monitor timing out repeatedly
-----------------------------------------


                 Key: HADOOP-2763
                 URL: https://issues.apache.org/jira/browse/HADOOP-2763
             Project: Hadoop Core
          Issue Type: Bug
          Components: dfs
    Affects Versions: 0.16.0
         Environment: Jan 28 nightly build

With patches 2095, 2119, and 2723
            Reporter: Christian Kunz


I upgraded a Hadoop installation to the Jan 28 nightly build.
DFS contains 5+ M files.

Fsck reported 1 hour after leaving safemode, 5274 under-replicated blocks with 
25 single replications, 3 hours later 433 under-replicated with still 20 single 
replications.

The namenode log shows repeated timeouts of the replication monitor for the 
same blocks:

2008-02-01 03:41:24,184 INFO org.apache.hadoop.dfs.StateChange: BLOCK* 
NameSystem.pendingTransfer: ask datanode to replicate blk_2984271423661664080 
to datanode(s) datanode1 datanode2
2008-02-01 03:51:14,104 WARN org.apache.hadoop.fs.FSNamesystem: 
PendingReplicationMonitor timed out block blk_2984271423661664080
2008-02-01 03:51:22,303 INFO org.apache.hadoop.dfs.StateChange: BLOCK* 
NameSystem.pendingTransfer: ask datanode to replicate blk_2984271423661664080 
to datanode(s) datanode3 datanode4
2008-02-01 04:01:14,150 WARN org.apache.hadoop.fs.FSNamesystem: 
PendingReplicationMonitor timed out block blk_2984271423661664080
2008-02-01 04:01:19,344 INFO org.apache.hadoop.dfs.StateChange: BLOCK* 
NameSystem.pendingTransfer: ask datanode to replicate blk_2984271423661664080 
to datanode(s) datanode5 datanode6
...

The datanode seems to be successfully transmitting the blocks:

2008-02-01 03:42:06,284 INFO org.apache.hadoop.dfs.DataNode: datanode Starting 
thread to transfer block blk_2984271423661664080 to datanode1, datannode2
2008-02-01 03:42:09,535 INFO org.apache.hadoop.dfs.DataNode: 
datanode:Transmitted block blk_2984271423661664080 to /datanode1

2008-02-01 03:52:06,238 INFO org.apache.hadoop.dfs.DataNode: datanode Starting 
thread to transfer block blk_2984271423661664080 to datanode3,datanode4
2008-02-01 03:52:09,470 INFO org.apache.hadoop.dfs.DataNode: 
datanode:Transmitted block blk_2984271423661664080 to /datanode3


The destination datanodes seem to have problems receiving these blocks (some 
time later for a different attempt):

2008-02-01 06:43:06,541 INFO org.apache.hadoop.dfs.DataNode: Receiving block 
blk_2984271423661664080 from /datanode
2008-02-01 06:43:09,647 INFO org.apache.hadoop.dfs.DataNode: Exception in 
receiveBlock for block blk_2984271423661664080 java.net.SocketException: 
Connection reset
2008-02-01 06:43:09,647 INFO org.apache.hadoop.dfs.DataNode: writeBlock 
blk_2984271423661664080 received exception java.net.SocketException: Connection 
reset

But I was successfully transferring the block between the two datanodes using 
scp.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (HADOOP-2763) Replication Monitor timing out repeatedly

Reply via email to