[
https://issues.apache.org/jira/browse/HADOOP-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12587830#action_12587830
]
rangadi edited comment on HADOOP-3234 at 4/10/08 7:30 PM:
---------------------------------------------------------------
I did not get stack traces but I think this is what happens on the second
datanode during second attempt :
- It receives the request for the block with 'isRecovery' set to true.
- In side {{FSDataset.writeToBlock()}} it interrupts the main receive thread
and waits for the thread to exit.
- The main receive thread from the first attempt waits for {{responder}} thread
to exit, but it does not interrupt it.
One fix could be to interrupt {{responder}} inside the main receiver thread.
was (Author: rangadi):
I did not get stack traces but I think this is what happens on the second
datanode during second attempt :
- It receives the request for the block with 'isRecovery' set to true.
- In side {{FSDataset.writeToBlock()}} it interrupts the main receive thread
for the thread to exit.
- The main receive thread from the first attempt wait for {{responder}} thread
to exit, but it does not iterrupt it.
The fix could interrupt {{responder}} inside the main receiver thread.
> Write pipeline does not recover from first node failure.
> --------------------------------------------------------
>
> Key: HADOOP-3234
> URL: https://issues.apache.org/jira/browse/HADOOP-3234
> Project: Hadoop Core
> Issue Type: Bug
> Affects Versions: 0.16.0
> Reporter: Raghu Angadi
> Priority: Blocker
>
> While investigating HADOOP-3132, we had a misconfiguration that resulted in
> client writing to first datanode in the pipeline with 15 second write
> timeout. As a result, client breaks the pipeline marking the first datanode
> (DN1) as the bad node. It then restarts the next pipeline with the rest of
> the of the datanodes. But the next (second) datanode was stuck waiting for
> the the earlier block-write to complete. So the client repeats this procedure
> until it runs out the datanodes and client write fails.
> I think this should be a blocker either for 0.16 or 0.17.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.