[ https://issues.apache.org/jira/browse/HADOOP-4291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12634852#action_12634852 ]
ZhuGuanyin commented on HADOOP-4291: ------------------------------------ seems that after try all datanodes, it clears the deadnode list and retry, enter an infinite loop. We add some debug code as follows: In DFSInputStream.blockSeekTo(): private synchronized DatanodeInfo blockSeekTo(long target) throws IOException { while (s == null) { LOG.info("blockSeekTo step 1"); DNAddrPair retval = chooseDataNode(targetBlock); LOG.info("blockSeekTo step 2"); try { blockReader = BlockReader.newBlockReader(); return chosenNode; } catch (IOException ex) { LOG.info("blockSeekTo step 3"); addToDeadNodes(chosenNode); if (s != null) { try { s.close(); } catch (IOException iex) { LOG.info("blockSeekTo step 4"); } } s = null; LOG.info("blockSeekTo step 5"); } LOG.info("blockSeekTo step 6"); } return chosenNode; } In DFSInputStream. chooseDataNode (): private DNAddrPair chooseDataNode(LocatedBlock block) throws IOException { LOG.info("chooseDataNode() step 1"); while (true) { LOG.info("chooseDataNode() step 2"); DatanodeInfo[] nodes = block.getLocations(); try { LOG.info("chooseDataNode() step 3, failures = " + failures); DatanodeInfo chosenNode = bestNode(nodes, deadNodes); LOG.info("chooseDataNode() step 4"); InetSocketAddress targetAddr = DataNode.createSocketAddr(chosenNode.getName()); LOG.info("chooseDataNode() step 5"); return new DNAddrPair(chosenNode, targetAddr); } catch (IOException ie) { String blockInfo = block.getBlock() + " file=" + src; LOG.info("chooseDataNode() step 6, failures = " + failures); if (failures >= MAX_BLOCK_ACQUIRE_FAILURES) { throw new IOException("Could not obtain block: " + blockInfo); } if (nodes == null || nodes.length == 0) { LOG.info("No node available for block: " + blockInfo); } LOG.info("Could not obtain block " + block.getBlock() + " from any node: " + ie); try { Thread.sleep(3000); } catch (InterruptedException iex) { } LOG.info("chooseDataNode() step 7, failures = " + failures); deadNodes.clear(); //2nd option is to remove only nodes[blockId] openInfo(); failures++; LOG.info("chooseDataNode() step 8, failures = " + failures); continue; } } } After we run ./hadoop dfs -cat /1.txt , we get the following stdout: [EMAIL PROTECTED] baidu.com ~]$ ./hadoop fs -cat /1.txt 08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 1 08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 1 08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 2 08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 3, failures = 0 08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 4 08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 5 08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 2 08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 3 08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 5 08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 6 08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 1 08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 1 08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 2 08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 3, failures = 0 08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 4 08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 5 08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 2 08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 3 08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 5 08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 6 08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 1 08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 1 08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 2 08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 3, failures = 0 08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 4 08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 5 08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 2 08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 3 08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 5 08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 6 08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 1 08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 1 08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 2 08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 3, failures = 0 08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 6, failures = 0 08/09/26 21:00:44 INFO fs.DFSClient: Could not obtain block blk_1225 from any node: java.io.IOException: No live nodes contain current block 08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 7, failures = 0 08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 8, failures = 1 08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 2 08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 3, failures = 1 08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 4 08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 5 08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 2 08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 3 08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 5 08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 6 08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 1 08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 1 08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 2 08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 3, failures = 0 08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 4 08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 5 08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 2 08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 3 08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 5 08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 6 08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 1 08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 1 08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 2 08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 3, failures = 0 08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 4 08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 5 08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 2 08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 3 08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 5 08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 6 08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 1 08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 1 08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 2 08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 3, failures = 0 08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 6, failures = 0 08/09/26 21:00:47 INFO fs.DFSClient: Could not obtain block blk_1225 from any node: java.io.IOException: No live nodes contain current block ......................................................................................................... > MapReduce Streaming job hang when all replications of the input file has > corrupted! > ----------------------------------------------------------------------------------- > > Key: HADOOP-4291 > URL: https://issues.apache.org/jira/browse/HADOOP-4291 > Project: Hadoop Core > Issue Type: Bug > Components: dfs > Affects Versions: 0.18.1 > Reporter: ZhuGuanyin > Priority: Critical > > On some special cases, all replications of a given file has truncated to zero > but the namenode still hold the original size (we don't know why), the > mapreduce streaming job will hang if we don't specified mapred.task.timeout > when the input files contain this corrupted file, even the dfs shell "cat" > will hang when fetch data from this corrupted file. > We found that job hang at DFSInputStream.blockSeekTo() when chosing a > datanode. The following test will show: > 1) Copy a little file to hdfs. > 2) Get the file blocks and login to these datanodes, and truncate these > blocks to zero. > 3) Cat this file through dfs shell "cat" > 4) Cat command will enter dead loop. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.