[jira] Issue Comment Edited: (HADOOP-3234) Write pipeline does not recover from first node failure.

Raghu Angadi (JIRA) Thu, 10 Apr 2008 19:33:00 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12587830#action_12587830
 ]


rangadi edited comment on HADOOP-3234 at 4/10/08 7:30 PM:
---------------------------------------------------------------

I did not get stack traces but I think this is what happens on the second 
datanode during second attempt :

- It receives the request for the block with 'isRecovery' set to true. 
- In side {{FSDataset.writeToBlock()}} it interrupts the main receive thread 
and waits for the thread to exit.
- The main receive thread from the first attempt waits for {{responder}} thread 
to exit, but it does not interrupt it.

One fix could be to interrupt {{responder}} inside the main receiver thread.

      was (Author: rangadi):
    I did not get stack traces but I think this is what happens on the second 
datanode during second attempt :

- It receives the request for the block with 'isRecovery' set to true. 
- In side {{FSDataset.writeToBlock()}} it interrupts the main receive thread 
for the thread to exit.
- The main receive thread from the first attempt wait for {{responder}} thread 
to exit, but it does not iterrupt it.

The fix could interrupt {{responder}} inside the main receiver thread.
  
> Write pipeline does not recover from first node failure.
> --------------------------------------------------------
>
>                 Key: HADOOP-3234
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3234
>             Project: Hadoop Core
>          Issue Type: Bug
>    Affects Versions: 0.16.0
>            Reporter: Raghu Angadi
>            Priority: Blocker
>
> While investigating HADOOP-3132, we had a misconfiguration that resulted in 
> client writing to first datanode in the pipeline with 15 second write 
> timeout. As a result, client breaks the pipeline marking the first datanode 
> (DN1) as the bad node. It then restarts the next pipeline with the rest of 
> the of the datanodes. But the next (second) datanode was stuck waiting for 
> the the earlier block-write to complete. So the client repeats this procedure 
> until it runs out the datanodes and client write fails.
> I think this should be a blocker either for 0.16 or 0.17.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-3234) Write pipeline does not recover from first node failure.

Reply via email to