[ https://issues.apache.org/jira/browse/HDFS-4272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Suresh Srinivas resolved HDFS-4272. ----------------------------------- Resolution: Duplicate Seems like duplicate of HDFS-4271. > Problem in DFSInputStream read retry logic may cause early failure > ------------------------------------------------------------------ > > Key: HDFS-4272 > URL: https://issues.apache.org/jira/browse/HDFS-4272 > Project: Hadoop HDFS > Issue Type: Bug > Reporter: Binglin Chang > Assignee: Binglin Chang > Priority: Minor > > Assume the following call logic > {noformat} > readWithStrategy() > -> blockSeekTo() > -> readBuffer() > -> reader.doRead() > -> seekToNewSource() add currentNode to deadnode, wish to get a > different datanode > -> blockSeekTo() > -> chooseDataNode() > -> block missing, clear deadNodes and pick the currentNode again > seekToNewSource() return false > readBuffer() re-throw the exception quit loop > readWithStrategy() got the exception, and may fail the read call before > tried MaxBlockAcquireFailures. > {noformat} > some issues of the logic: > 1. seekToNewSource() logic is broken because it may clear deadNodes in the > middle. > 2. the variable "int retries=2" in readWithStrategy seems have conflict with > MaxBlockAcquireFailures, should it be removed? > I write a test to produce the scenario, and here is part of the log: > {noformat} > 2012-12-05 22:55:15,135 WARN hdfs.DFSClient > (DFSInputStream.java:readBuffer(596)) - Found Checksum error for > BP-50712310-192.168.0.101-1354719313473:blk_-705068286766485620_1002 from > 127.0.0.1:50099 at 0 > 2012-12-05 22:55:15,136 INFO DataNode.clienttrace > (BlockSender.java:sendBlock(672)) - src: /127.0.0.1:50099, dest: > /127.0.0.1:50105, bytes: 4128, op: HDFS_READ, cliID: > DFSClient_NONMAPREDUCE_-1488457569_1, offset: 0, srvID: > DS-91625336-192.168.0.101-50099-1354719314603, blockid: > BP-50712310-192.168.0.101-1354719313473:blk_-705068286766485620_1002, > duration: 2925000 > 2012-12-05 22:55:15,136 INFO hdfs.DFSClient > (DFSInputStream.java:chooseDataNode(741)) - Could not obtain > BP-50712310-192.168.0.101-1354719313473:blk_-705068286766485620_1002 from any > node: java.io.IOException: No live nodes contain current block. Will get new > block locations from namenode and retry... > 2012-12-05 22:55:15,136 WARN hdfs.DFSClient > (DFSInputStream.java:chooseDataNode(756)) - DFS chooseDataNode: got # 1 > IOException, will wait for 274.34891931868265 msec. > 2012-12-05 22:55:15,413 INFO DataNode.clienttrace > (BlockSender.java:sendBlock(672)) - src: /127.0.0.1:50099, dest: > /127.0.0.1:50106, bytes: 4128, op: HDFS_READ, cliID: > DFSClient_NONMAPREDUCE_-1488457569_1, offset: 0, srvID: > DS-91625336-192.168.0.101-50099-1354719314603, blockid: > BP-50712310-192.168.0.101-1354719313473:blk_-705068286766485620_1002, > duration: 283000 > 2012-12-05 22:55:15,414 INFO hdfs.StateChange > (FSNamesystem.java:reportBadBlocks(4761)) - *DIR* reportBadBlocks > 2012-12-05 22:55:15,415 INFO BlockStateChange > (CorruptReplicasMap.java:addToCorruptReplicasMap(66)) - BLOCK > NameSystem.addToCorruptReplicasMap: blk_-705068286766485620 added as corrupt > on 127.0.0.1:50099 by null because client machine reported it > 2012-12-05 22:55:15,416 INFO hdfs.TestClientReportBadBlock > (TestDFSInputStream.java:testDFSInputStreamReadRetryTime(94)) - catch > IOExceptionorg.apache.hadoop.fs.ChecksumException: Checksum error: /testFile > at 0 exp: 809972010 got: -1374622118 > 2012-12-05 22:55:15,431 INFO hdfs.MiniDFSCluster > (MiniDFSCluster.java:shutdown(1411)) - Shutting down the Mini HDFS Cluster > {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira