Client uselessly retries recoverBlock 5 times ---------------------------------------------
Key: HDFS-1236 URL: https://issues.apache.org/jira/browse/HDFS-1236 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 0.20.1 Reporter: Thanh Do Summary: Client uselessly retries recoverBlock 5 times The same behavior is also seen in append protocol (HDFS-1229) The setup: # available datanodes = 4 Replication factor = 2 (hence there are 2 datanodes in the pipeline) Failure type = Bad disk at datanode (not crashes) # failures = 2 # disks / datanode = 1 Where/when the failures happen: This is a scenario where each disk of the two datanodes in the pipeline go bad at the same time during the 2nd phase of the pipeline (the data transfer phase). Details: In this case, the client will call processDatanodeError which will call datanode.recoverBlock to those two datanodes. But since these two datanodes have bad disks (although they're still alive), then recoverBlock() will fail. For this one, the client's retry logic ends when streamer is closed (close == true). But before this happen, the client will retry 5 times (maxRecoveryErrorCount) and will fail all the time, until it finishes. What is interesting is that during each retry, there is a wait of 1 second in DataStreamer.run (i.e. dataQueue.wait(1000)). So it will be a 5-second total wait before declaring it fails. This is a different bug than HDFS-1235, where the client retries 3 times for 6 seconds (resulting in 25 seconds wait time). In this experiment, what we get for the total wait time is only 12 seconds (not sure why it is 12). So the DFSClient quits without contacting the namenode again (say to ask for a new set of two datanodes). So interestingly we find another bug that shows client retry logic is complex and not deterministic depending on where and when failures happen. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.