[jira] Created: (HDFS-1236) Client uselessly retries recoverBlock 5 times

Thanh Do (JIRA) Wed, 16 Jun 2010 22:44:52 -0700

Client uselessly retries recoverBlock 5 times
---------------------------------------------


                 Key: HDFS-1236
                 URL: https://issues.apache.org/jira/browse/HDFS-1236
             Project: Hadoop HDFS
          Issue Type: Bug
    Affects Versions: 0.20.1
            Reporter: Thanh Do


Summary:
Client uselessly retries recoverBlock 5 times
The same behavior is also seen in append protocol (HDFS-1229)

The setup:
# available datanodes = 4
Replication factor = 2 (hence there are 2 datanodes in the pipeline)
Failure type = Bad disk at datanode (not crashes)
# failures = 2
# disks / datanode = 1
Where/when the failures happen: This is a scenario where each disk of the two 
datanodes in the pipeline go bad at the same time during the 2nd phase of the 
pipeline (the data transfer phase).
 
Details:
 
In this case, the client will call processDatanodeError
which will call datanode.recoverBlock to those two datanodes.
But since these two datanodes have bad disks (although they're still alive),
then recoverBlock() will fail.
For this one, the client's retry logic ends when streamer is closed (close == 
true).
But before this happen, the client will retry 5 times
(maxRecoveryErrorCount) and will fail all the time, until
it finishes.  What is interesting is that
during each retry, there is a wait of 1 second in
DataStreamer.run (i.e. dataQueue.wait(1000)).
So it will be a 5-second total wait before declaring it fails.
 
This is a different bug than HDFS-1235, where the client retries
3 times for 6 seconds (resulting in 25 seconds wait time).
In this experiment, what we get for the total wait time is only
12 seconds (not sure why it is 12). So the DFSClient quits without
contacting the namenode again (say to ask for a new set of
two datanodes).
So interestingly we find another
bug that shows client retry logic is complex and not deterministic
depending on where and when failures happen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (HDFS-1236) Client uselessly retries recoverBlock 5 times

Reply via email to