A block is stuck in ongoingRecovery due to exception not propagated --------------------------------------------------------------------
Key: HDFS-1238 URL: https://issues.apache.org/jira/browse/HDFS-1238 Project: Hadoop HDFS Issue Type: Bug Components: hdfs client Affects Versions: 0.20.1 Reporter: Thanh Do - Setup: + # datanodes = 2 + replication factor = 2 + failure type = transient (i.e. a java I/O call that throws I/O Exception or returns false) + # failures = 2 + When/where failures happen: (This is a subtle bug) The first failure is a transient failure at a datanode during the second phase. Due to the first failure, the DFSClient will call recoverBlock. The second failure is injected during this recover block process (i.e. another failure during the recovery process). - Details: The expectation here is that since the DFSClient performs lots of retries, two transient failures should be masked properly by the retries. We found one case, where the failures are not transparent to the users. Here's the stack trace of when/where the two failures happen (please ignore the line number). 1. The first failure: Exception is thrown at call(void java.io.DataOutputStream.flush()) SourceLoc: org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java(252) Stack Trace: [0] datanode.BlockReceiver (flush:252) [1] datanode.BlockReceiver (receivePacket:660) [2] datanode.BlockReceiver (receiveBlock:743) [3] datanode.DataXceiver (writeBlock:468) [4] datanode.DataXceiver (run:119) 2. The second failure: False is returned at call(boolean java.io.File.renameTo(File)) SourceLoc: org/apache/hadoop/hdfs/server/datanode/FSDataset.java(105) Stack Trace: [0] datanode.FSDataset (tryUpdateBlock:1008) [1] datanode.FSDataset (updateBlock:859) [2] datanode.DataNode (updateBlock:1780) [3] datanode.DataNode (syncBlock:2032) [4] datanode.DataNode (recoverBlock:1962) [5] datanode.DataNode (recoverBlock:2101) This is what we found out: The first failure causes the DFSClient to somehow calls recoverBlock, which will force us to see the 2nd failure. The 2nd failure makes renameTo returns false, which then causes an IOException to be thrown from the function that calls renameTo. But this IOException is not propagated properly! It is dropped inside DN.syncBlock(). Specifically DN.syncBlock calls DN.updateBlock() which gets the exception. But syncBlock only catches that and prints a warning without propagating the exception properly. Thus syncBlock returns without any exception, and thus recoverBlock returns without executing the finally{} block (see below). Now, the client will retry recoverBlock for 3-5 more times, but this retries always see exceptions! The reason is that the first time we call recoverBlock(blk), this blk is being put into an ongoingRecovery list inside DN.recoverBlock(). Normally, blk is only removed (ongoingRecovery.remove(blk)) inside the finally{} block. But since the exception is not propagated properly, this finally{} block is never called, thus the blk is stuck forever inside the ongoingRecovery list, and hence the next time client performs the retry, it gets this error message "Block ... is already being recovered" and recoverBlock() throws IOException. As a result, the client which calls this whole process in the context of processDatanodeError will return from the pde function with closed = true, and hence it never retries the whole thing again from the beginning, and instead just returns error. This bug was found by our Failure Testing Service framework: http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html For questions, please email us: Thanh Do (than...@cs.wisc.edu) and Haryadi Gunawi (hary...@eecs.berkeley.edu) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.