[ https://issues.apache.org/jira/browse/HDFS-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Todd Lipcon resolved HDFS-1236. ------------------------------- Resolution: Invalid I don't consider the retry useless - there may be transient errors preventing recovery (eg network errors). The 6 second sleep is addressed by HDFS-1054 > Client uselessly retries recoverBlock 5 times > --------------------------------------------- > > Key: HDFS-1236 > URL: https://issues.apache.org/jira/browse/HDFS-1236 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs client > Affects Versions: 0.20.1 > Reporter: Thanh Do > > Summary: > Client uselessly retries recoverBlock 5 times > The same behavior is also seen in append protocol (HDFS-1229) > The setup: > + # available datanodes = 4 > + Replication factor = 2 (hence there are 2 datanodes in the pipeline) > + Failure type = Bad disk at datanode (not crashes) > + # failures = 2 > + # disks / datanode = 1 > + Where/when the failures happen: This is a scenario where each disk of the > two datanodes in the pipeline go bad at the same time during the 2nd phase of > the pipeline (the data transfer phase). > > Details: > > In this case, the client will call processDatanodeError > which will call datanode.recoverBlock to those two datanodes. > But since these two datanodes have bad disks (although they're still alive), > then recoverBlock() will fail. > For this one, the client's retry logic ends when streamer is closed (close == > true). > But before this happen, the client will retry 5 times > (maxRecoveryErrorCount) and will fail all the time, until > it finishes. What is interesting is that > during each retry, there is a wait of 1 second in > DataStreamer.run (i.e. dataQueue.wait(1000)). > So it will be a 5-second total wait before declaring it fails. > > This is a different bug than HDFS-1235, where the client retries > 3 times for 6 seconds (resulting in 25 seconds wait time). > In this experiment, what we get for the total wait time is only > 12 seconds (not sure why it is 12). So the DFSClient quits without > contacting the namenode again (say to ask for a new set of > two datanodes). > So interestingly we find another > bug that shows client retry logic is complex and not deterministic > depending on where and when failures happen. > This bug was found by our Failure Testing Service framework: > http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html > For questions, please email us: Thanh Do (than...@cs.wisc.edu) and > Haryadi Gunawi (hary...@eecs.berkeley.edu) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.