[ 
https://issues.apache.org/jira/browse/HDFS-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-1236.
-------------------------------

    Resolution: Invalid

I don't consider the retry useless - there may be transient errors preventing 
recovery (eg network errors). The 6 second sleep is addressed by HDFS-1054

> Client uselessly retries recoverBlock 5 times
> ---------------------------------------------
>
>                 Key: HDFS-1236
>                 URL: https://issues.apache.org/jira/browse/HDFS-1236
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs client
>    Affects Versions: 0.20.1
>            Reporter: Thanh Do
>
> Summary:
> Client uselessly retries recoverBlock 5 times
> The same behavior is also seen in append protocol (HDFS-1229)
> The setup:
> + # available datanodes = 4
> + Replication factor = 2 (hence there are 2 datanodes in the pipeline)
> + Failure type = Bad disk at datanode (not crashes)
> + # failures = 2
> + # disks / datanode = 1
> + Where/when the failures happen: This is a scenario where each disk of the 
> two datanodes in the pipeline go bad at the same time during the 2nd phase of 
> the pipeline (the data transfer phase).
>  
> Details:
>  
> In this case, the client will call processDatanodeError
> which will call datanode.recoverBlock to those two datanodes.
> But since these two datanodes have bad disks (although they're still alive),
> then recoverBlock() will fail.
> For this one, the client's retry logic ends when streamer is closed (close == 
> true).
> But before this happen, the client will retry 5 times
> (maxRecoveryErrorCount) and will fail all the time, until
> it finishes.  What is interesting is that
> during each retry, there is a wait of 1 second in
> DataStreamer.run (i.e. dataQueue.wait(1000)).
> So it will be a 5-second total wait before declaring it fails.
>  
> This is a different bug than HDFS-1235, where the client retries
> 3 times for 6 seconds (resulting in 25 seconds wait time).
> In this experiment, what we get for the total wait time is only
> 12 seconds (not sure why it is 12). So the DFSClient quits without
> contacting the namenode again (say to ask for a new set of
> two datanodes).
> So interestingly we find another
> bug that shows client retry logic is complex and not deterministic
> depending on where and when failures happen.
> This bug was found by our Failure Testing Service framework:
> http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
> For questions, please email us: Thanh Do (than...@cs.wisc.edu) and
> Haryadi Gunawi (hary...@eecs.berkeley.edu)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to