[
https://issues.apache.org/jira/browse/HDFS-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Todd Lipcon resolved HDFS-1236.
-------------------------------
Resolution: Invalid
I don't consider the retry useless - there may be transient errors preventing
recovery (eg network errors). The 6 second sleep is addressed by HDFS-1054
> Client uselessly retries recoverBlock 5 times
> ---------------------------------------------
>
> Key: HDFS-1236
> URL: https://issues.apache.org/jira/browse/HDFS-1236
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: hdfs client
> Affects Versions: 0.20.1
> Reporter: Thanh Do
>
> Summary:
> Client uselessly retries recoverBlock 5 times
> The same behavior is also seen in append protocol (HDFS-1229)
> The setup:
> + # available datanodes = 4
> + Replication factor = 2 (hence there are 2 datanodes in the pipeline)
> + Failure type = Bad disk at datanode (not crashes)
> + # failures = 2
> + # disks / datanode = 1
> + Where/when the failures happen: This is a scenario where each disk of the
> two datanodes in the pipeline go bad at the same time during the 2nd phase of
> the pipeline (the data transfer phase).
>
> Details:
>
> In this case, the client will call processDatanodeError
> which will call datanode.recoverBlock to those two datanodes.
> But since these two datanodes have bad disks (although they're still alive),
> then recoverBlock() will fail.
> For this one, the client's retry logic ends when streamer is closed (close ==
> true).
> But before this happen, the client will retry 5 times
> (maxRecoveryErrorCount) and will fail all the time, until
> it finishes. What is interesting is that
> during each retry, there is a wait of 1 second in
> DataStreamer.run (i.e. dataQueue.wait(1000)).
> So it will be a 5-second total wait before declaring it fails.
>
> This is a different bug than HDFS-1235, where the client retries
> 3 times for 6 seconds (resulting in 25 seconds wait time).
> In this experiment, what we get for the total wait time is only
> 12 seconds (not sure why it is 12). So the DFSClient quits without
> contacting the namenode again (say to ask for a new set of
> two datanodes).
> So interestingly we find another
> bug that shows client retry logic is complex and not deterministic
> depending on where and when failures happen.
> This bug was found by our Failure Testing Service framework:
> http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
> For questions, please email us: Thanh Do ([email protected]) and
> Haryadi Gunawi ([email protected])
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.