[ 
https://issues.apache.org/jira/browse/HADOOP-2763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12564701#action_12564701
 ] 

Christian Kunz commented on HADOOP-2763:
----------------------------------------

7 hours after leaving safemode there are  179 under-replicated blocks, with 14 
single replications. I was successfully copying 4 critical files containing 
single-replicated blocks to local filesystem.

Question is: why can replication of a block from a datanode repeatedly fail 
although it is possible to copy it to local filesystem?

> Replication Monitor timing out repeatedly
> -----------------------------------------
>
>                 Key: HADOOP-2763
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2763
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.16.0
>         Environment: Jan 28 nightly build
> With patches 2095, 2119, and 2723
>            Reporter: Christian Kunz
>
> I upgraded a Hadoop installation to the Jan 28 nightly build.
> DFS contains 5+ M files.
> Fsck reported 1 hour after leaving safemode, 5274 under-replicated blocks 
> with 25 single replications, 3 hours later 433 under-replicated with still 20 
> single replications.
> The namenode log shows repeated timeouts of the replication monitor for the 
> same blocks:
> 2008-02-01 03:41:24,184 INFO org.apache.hadoop.dfs.StateChange: BLOCK* 
> NameSystem.pendingTransfer: ask datanode to replicate blk_2984271423661664080 
> to datanode(s) datanode1 datanode2
> 2008-02-01 03:51:14,104 WARN org.apache.hadoop.fs.FSNamesystem: 
> PendingReplicationMonitor timed out block blk_2984271423661664080
> 2008-02-01 03:51:22,303 INFO org.apache.hadoop.dfs.StateChange: BLOCK* 
> NameSystem.pendingTransfer: ask datanode to replicate blk_2984271423661664080 
> to datanode(s) datanode3 datanode4
> 2008-02-01 04:01:14,150 WARN org.apache.hadoop.fs.FSNamesystem: 
> PendingReplicationMonitor timed out block blk_2984271423661664080
> 2008-02-01 04:01:19,344 INFO org.apache.hadoop.dfs.StateChange: BLOCK* 
> NameSystem.pendingTransfer: ask datanode to replicate blk_2984271423661664080 
> to datanode(s) datanode5 datanode6
> ...
> The datanode seems to be successfully transmitting the blocks:
> 2008-02-01 03:42:06,284 INFO org.apache.hadoop.dfs.DataNode: datanode 
> Starting thread to transfer block blk_2984271423661664080 to datanode1, 
> datannode2
> 2008-02-01 03:42:09,535 INFO org.apache.hadoop.dfs.DataNode: 
> datanode:Transmitted block blk_2984271423661664080 to /datanode1
> 2008-02-01 03:52:06,238 INFO org.apache.hadoop.dfs.DataNode: datanode 
> Starting thread to transfer block blk_2984271423661664080 to 
> datanode3,datanode4
> 2008-02-01 03:52:09,470 INFO org.apache.hadoop.dfs.DataNode: 
> datanode:Transmitted block blk_2984271423661664080 to /datanode3
> The destination datanodes seem to have problems receiving these blocks (some 
> time later for a different attempt):
> 2008-02-01 06:43:06,541 INFO org.apache.hadoop.dfs.DataNode: Receiving block 
> blk_2984271423661664080 from /datanode
> 2008-02-01 06:43:09,647 INFO org.apache.hadoop.dfs.DataNode: Exception in 
> receiveBlock for block blk_2984271423661664080 java.net.SocketException: 
> Connection reset
> 2008-02-01 06:43:09,647 INFO org.apache.hadoop.dfs.DataNode: writeBlock 
> blk_2984271423661664080 received exception java.net.SocketException: 
> Connection reset
> But I was successfully transferring the block between the two datanodes using 
> scp.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to