[ 
https://issues.apache.org/jira/browse/HADOOP-5796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782231#action_12782231
 ] 

Todd Lipcon commented on HADOOP-5796:
-------------------------------------

Not certain if what I"m seeing is the exact same cause, but I have another 
reproducible case in which the write pipeline recovery decides the first node 
is dead every time, when in actuality it's the last node that's dead. In my 
case, I've set up a 3-node HDFS cluster with replication 3, and each DN having 
one 100G volume and one 2G volume. The 2Gs fill up, throw 
DiskOutOfSpaceExceptions, and the write pipeline recovers incorrectly when the 
node that runs out of space is the last. It first ejects pipeline[0], fails 
again when trying to continue the write on the dead node, ejects the second, 
then tries again writing only to the failed node. Of course that fails too, and 
the whole write is aborted.

I'll try applying this patch (and thinking it through a bit further) and seeing 
if it resolves the issue.

> DFS Write pipeline does not detect defective datanode correctly in some cases 
> (HADOOP-3339)
> -------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5796
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5796
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 0.19.0
>            Reporter: Raghu Angadi
>             Fix For: 0.20.2
>
>         Attachments: toreproduce-5796.patch
>
>
> HDFS write pipeline does not select the correct datanode in some error cases. 
> One example : say DN2 is the second datanode and write to it times out since 
> it is in a bad state.. pipeline actually removes the first datanode. If such 
> a datanode happens to be the last one in the pipeline, write is aborted 
> completely with a hard error.
> Essentially the error occurs when writing to a downstream datanode fails 
> rather than reading. This bug was actually fixed in 0.18 (HADOOP-3339). But 
> HADOOP-1700 essentially reverted it. I am not sure why.
> It is absolutely essential for HDFS to handle failures on subset of datanodes 
> in a pipeline. We should not have at least known bugs that lead to hard 
> failures.
> I will attach patch for a hack that illustrates this problem. Still thinking 
> of how an automated test would look like for this one. 
> My preferred target for  this fix is 0.20.1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to