[ 
https://issues.apache.org/jira/browse/HADOOP-1955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raghu Angadi updated HADOOP-1955:
---------------------------------

    Attachment: HADOOP-1955.patch


The latest patch includes unit test for proper replication with corrupted 
blocks. 

The actual fix for the jira is confined to FSNamesystem.java. Rest of the 
changes are for the unit test.

I reluctantly added two config variables (not public in hadoop-default.xml) to 
make this test stable. One is a timeout for pendingReplication and other is 
timeout used by datanode to deleted the failed blocks from tmp directory. In 
the case of Datanode, an altrnative is to delete the tmp files immediately when 
a write fails. There is no reason to keep those failed blocks around.

Dhruba, could you scan through the test and other changes? Thanks.


> Corrupted block replication retries for ever
> --------------------------------------------
>
>                 Key: HADOOP-1955
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1955
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.14.1
>            Reporter: Koji Noguchi
>            Assignee: Raghu Angadi
>            Priority: Blocker
>             Fix For: 0.14.2
>
>         Attachments: HADOOP-1955-branch14.patch, HADOOP-1955.patch, 
> HADOOP-1955.patch
>
>
> When replicating corrupted block, receiving side rejects the block due to 
> checksum error. Namenode keeps on retrying (with the same source datanode).
> Fsck shows those blocks as under-replicated.
> [Namenode log]
> {noformat} 
> 2007-09-27 02:00:05,273 INFO org.apache.hadoop.dfs.StateChange: BLOCK* 
> NameSystem.heartbeatCheck: lost heartbeat from 99.2.99.111
> ...
> 2007-09-27 02:01:02,618 INFO org.apache.hadoop.dfs.StateChange: BLOCK* 
> NameSystem.pendingTransfer: ask 99.9.99.11:9999 to replicate 
> blk_-5925066143536023890 to datanode(s) 99.9.99.37:9999
> 2007-09-27 02:10:03,843 WARN org.apache.hadoop.fs.FSNamesystem: 
> PendingReplicationMonitor timed out block blk_-5925066143536023890
> 2007-09-27 02:10:08,248 INFO org.apache.hadoop.dfs.StateChange: BLOCK* 
> NameSystem.pendingTransfer: ask 99.9.99.11:9999 to replicate 
> blk_-5925066143536023890 to datanode(s) 99.9.99.35:9999
> 2007-09-27 02:20:03,848 WARN org.apache.hadoop.fs.FSNamesystem: 
> PendingReplicationMonitor timed out block blk_-5925066143536023890
> 2007-09-27 02:20:08,646 INFO org.apache.hadoop.dfs.StateChange: BLOCK* 
> NameSystem.pendingTransfer: ask 99.9.99.11:9999 to replicate 
> blk_-5925066143536023890 to datanode(s) 99.9.99.19:9999
> (repeats)
> {noformat} 
> [Datanode(sender) 99.9.99.11 log]
> {noformat} 
> 2007-09-27 02:01:04,493 INFO org.apache.hadoop.dfs.DataNode: Starting thread 
> to transfer block blk_-5925066143536023890 to 
> [Lorg.apache.hadoop.dfs.DatanodeInfo;@e58187
> 2007-09-27 02:01:05,153 WARN org.apache.hadoop.dfs.DataNode: Failed to 
> transfer blk_-5925066143536023890 to 74.6.128.37:50010 got 
> java.net.SocketException: Connection reset
>   at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
>   at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
>   at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>   at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
>   at java.io.DataOutputStream.write(DataOutputStream.java:90)
>   at org.apache.hadoop.dfs.DataNode.sendBlock(DataNode.java:1231)
>   at org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:1280)
>   at java.lang.Thread.run(Thread.java:619)
> (repeats)
> {noformat} 
> [Datanode(one of the receiver) 99.9.99.37 log]
> {noformat} 
> 2007-09-27 02:01:05,150 ERROR org.apache.hadoop.dfs.DataNode: DataXceiver: 
> java.io.IOException: Unexpected checksum mismatch while writing 
> blk_-5925066143536023890 from /74.6.128.33:57605
>   at org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:902)
>   at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:727)
>   at java.lang.Thread.run(Thread.java:619)
> {noformat} 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to