[jira] Created: (HADOOP-6038) On a busy cluster, it is possible for the client to believe it cannot fetch a block when the client or datanodes are running slowly

Jim Kellerman (JIRA) Sat, 13 Jun 2009 20:42:39 -0700

On a busy cluster, it is possible for the client to believe it cannot fetch a 
block when the client or datanodes are running slowly
-----------------------------------------------------------------------------------------------------------------------------------


                 Key: HADOOP-6038
                 URL: https://issues.apache.org/jira/browse/HADOOP-6038
             Project: Hadoop Core
          Issue Type: Improvement
          Components: dfs
    Affects Versions: 0.20.0, 0.19.1, 0.19.0
         Environment: 100 node cluster, fedora, 1TB disk per machine available 
for HDFS (two spindles) 16GB RAM, 8 cores
running datanode, TaskTracker, HBaseRegionServer and the task being executed by 
the TaskTracker. 
            Reporter: Jim Kellerman
             Fix For: 0.19.2, 0.20.1, 0.21.0


On a heavily loaded node, the communication between a DFSClient can time out or 
fail leading DFSClient to believe the datanode is non-responsive even though 
the datanode is, in fact, healthy. It may run through all the retries for that 
datanode leading DFSClient to mark the datanode "dead".  

This can continue as DFSClient iterates through the other datanodes for the 
block it is looking for, and then DFSClient will declare that it can't find any 
servers for that block (even though all n (where n = replication factor) 
datanodes are healthy (but slow) and have valid copies of the block.

It is also possible that the process running the DFSClient is too slow and 
misses (or times out) responses from the data node, resulting in the DFSClient 
believing that the datanode is dead.

Another possibility is that the block has been moved from one or more datanodes 
since DFSClient$DFSInputStream.chooseDataNode() found the locations of the 
block.

When the retries for each datanode and all datanodes are exhausted, 
DFSClient$DFSInputStream.chooseDataNode() issues the warning:

{code}
          if (nodes == null || nodes.length == 0) {
            LOG.info("No node available for block: " + blockInfo);
          }
          LOG.info("Could not obtain block " + block.getBlock() + " from any 
node:  " + ie);
{code}

It would be an improvement, and not impact performance under normal conditions 
if  when DFSClient decides that it cannot find the block anywhere, for it to 
retry finding the block by calling 

{code}
private static LocatedBlocks callGetBlockLocations()
{code}
 
*once* , to attempt to recover from machine(s) being too busy, or the block 
being relocated since the initial call to callGetBlockLocations(). If the 
second attempt to find the block based on what the namenode told DFSClient,  
then issue the messages and give up by throwing the exception it does today.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (HADOOP-6038) On a busy cluster, it is possible for the client to believe it cannot fetch a block when the client or datanodes are running slowly

Reply via email to