DFSClient "Could not obtain block:..."
--------------------------------------

                 Key: HADOOP-5903
                 URL: https://issues.apache.org/jira/browse/HADOOP-5903
             Project: Hadoop Core
          Issue Type: Bug
          Components: dfs
    Affects Versions: 0.20.0, 0.19.1, 0.19.0, 0.18.3
            Reporter: stack


We see this frequently in our application, hbase, where dfsclients are held 
open across long periods of time. It would seem that any hiccup fetching a 
block becomes a permanent black mark and though the serving datanode passes out 
a temporary slowness or outage, the dfsclient never seems to pick up on this 
fact.  Our perception is too sensitive to the vagaries of cluster comings and 
goings and succumbs too easily, especially given that a fresh dfsclient has not 
problem fetching the designated block.

Chatting with Raghu and Hairong yesterday, Hairong pointed out that the 
dfsclient frequently updates its list of block locations -- if a block has 
moved or if a datanode is dead, then dfsclient should be keeping with the 
changing state of the cluster (I see this happening in DFSClient#chooseDatanode 
on failure) but Raghu looks like he put his finger on our problem by noticing 
that the failures count is only incremented -- never decremented.  ANY three 
failures, no matter how many blocks in a file nor that a block that failed once 
now works, are enough for the DFSClient to start throwing "Could not obtain 
block:...".

The failures counter needs to be a little smarter.  Would a patch that adds a 
map of blocks to failure counts be the right way to go?  Failures should note 
the datanode that the failure was gotten against so that if the datanode came 
online again (retry), we could decrement the mark that had made against the 
block?

What do folks think?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to