[ 
https://issues.apache.org/jira/browse/HADOOP-4291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12634852#action_12634852
 ] 

ZhuGuanyin commented on HADOOP-4291:
------------------------------------

seems that after try all datanodes, it clears the deadnode list and retry, 
enter an infinite loop.

We add some debug code as follows:

 In DFSInputStream.blockSeekTo(): 
    
private synchronized DatanodeInfo blockSeekTo(long target) throws IOException {
      while (s == null) {

        LOG.info("blockSeekTo step 1"); 
        DNAddrPair retval = chooseDataNode(targetBlock);
        LOG.info("blockSeekTo step 2"); 
        try {
          blockReader = BlockReader.newBlockReader();
          return chosenNode;
        } catch (IOException ex) {
          LOG.info("blockSeekTo step 3"); 
          addToDeadNodes(chosenNode);
          if (s != null) {
            try {
              s.close();
            } catch (IOException iex) {
            LOG.info("blockSeekTo step 4"); 
            }                        
          }
          s = null;
        LOG.info("blockSeekTo step 5"); 
        }
        LOG.info("blockSeekTo step 6"); 
      }
      return chosenNode;
}



In DFSInputStream. chooseDataNode ():
private DNAddrPair chooseDataNode(LocatedBlock block)
      throws IOException {
      LOG.info("chooseDataNode() step 1");
      while (true) {
        LOG.info("chooseDataNode() step 2");
        DatanodeInfo[] nodes = block.getLocations();
        try {
          LOG.info("chooseDataNode() step 3, failures = " + failures);
          DatanodeInfo chosenNode = bestNode(nodes, deadNodes);
          LOG.info("chooseDataNode() step 4");
          InetSocketAddress targetAddr = 
DataNode.createSocketAddr(chosenNode.getName());
          LOG.info("chooseDataNode() step 5");
          return new DNAddrPair(chosenNode, targetAddr);
        } catch (IOException ie) {
          String blockInfo = block.getBlock() + " file=" + src;
          LOG.info("chooseDataNode() step 6, failures = " + failures);
          if (failures >= MAX_BLOCK_ACQUIRE_FAILURES) {
            throw new IOException("Could not obtain block: " + blockInfo);
          }
          
          if (nodes == null || nodes.length == 0) {
            LOG.info("No node available for block: " + blockInfo);
          }
          LOG.info("Could not obtain block " + block.getBlock() + " from any 
node:  " + ie);
          try {
            Thread.sleep(3000);
          } catch (InterruptedException iex) {
          }
          LOG.info("chooseDataNode() step 7, failures = " + failures);
          deadNodes.clear(); //2nd option is to remove only nodes[blockId]
          openInfo();
          failures++;
          LOG.info("chooseDataNode() step 8, failures = " + failures);
          continue;
        }
      }
    } 

After we run ./hadoop dfs -cat /1.txt , we get the following stdout:

[EMAIL PROTECTED] baidu.com ~]$ ./hadoop fs -cat /1.txt
08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 1
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 1
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 2
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 3, failures = 0
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 4
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 5
08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 2
08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 3
08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 5
08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 6
08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 1
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 1
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 2
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 3, failures = 0
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 4
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 5
08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 2
08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 3
08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 5
08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 6
08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 1
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 1
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 2
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 3, failures = 0
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 4
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 5
08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 2
08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 3
08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 5
08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 6
08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 1
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 1
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 2
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 3, failures = 0
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 6, failures = 0
08/09/26 21:00:44 INFO fs.DFSClient: Could not obtain block blk_1225 from any 
node:  java.io.IOException: No live nodes contain current block

08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 7, failures = 0
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 8, failures = 1
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 2
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 3, failures = 1
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 4
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 5
08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 2
08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 3
08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 5
08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 6
08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 1
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 1
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 2
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 3, failures = 0
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 4
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 5
08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 2
08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 3
08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 5
08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 6
08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 1
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 1
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 2
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 3, failures = 0
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 4
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 5
08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 2
08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 3
08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 5
08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 6
08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 1
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 1
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 2
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 3, failures = 0
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 6, failures = 0
08/09/26 21:00:47 INFO fs.DFSClient: Could not obtain block blk_1225 from any 
node:  java.io.IOException: No live nodes contain current block
.........................................................................................................





> MapReduce Streaming job hang when all replications of the input file has 
> corrupted!
> -----------------------------------------------------------------------------------
>
>                 Key: HADOOP-4291
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4291
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.1
>            Reporter: ZhuGuanyin
>            Priority: Critical
>
> On some special cases, all replications of a given file has truncated to zero 
>  but the namenode still hold the original size (we don't know why),  the 
> mapreduce streaming job will hang if we don't specified mapred.task.timeout 
> when the input files contain this corrupted file, even the dfs shell "cat" 
> will hang when fetch data from this corrupted file.
> We found that job hang at DFSInputStream.blockSeekTo() when chosing a 
> datanode.  The following test will show:
> 1)    Copy a little file to hdfs. 
> 2)    Get the file blocks and login to these datanodes, and truncate these 
> blocks to zero.
> 3)    Cat this file through dfs shell "cat"
> 4)    Cat command will enter dead loop.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to