DFSClient "Could not obtain block:..."
--------------------------------------
Key: HADOOP-5903
URL: https://issues.apache.org/jira/browse/HADOOP-5903
Project: Hadoop Core
Issue Type: Bug
Components: dfs
Affects Versions: 0.20.0, 0.19.1, 0.19.0, 0.18.3
Reporter: stack
We see this frequently in our application, hbase, where dfsclients are held
open across long periods of time. It would seem that any hiccup fetching a
block becomes a permanent black mark and though the serving datanode passes out
a temporary slowness or outage, the dfsclient never seems to pick up on this
fact. Our perception is too sensitive to the vagaries of cluster comings and
goings and succumbs too easily, especially given that a fresh dfsclient has not
problem fetching the designated block.
Chatting with Raghu and Hairong yesterday, Hairong pointed out that the
dfsclient frequently updates its list of block locations -- if a block has
moved or if a datanode is dead, then dfsclient should be keeping with the
changing state of the cluster (I see this happening in DFSClient#chooseDatanode
on failure) but Raghu looks like he put his finger on our problem by noticing
that the failures count is only incremented -- never decremented. ANY three
failures, no matter how many blocks in a file nor that a block that failed once
now works, are enough for the DFSClient to start throwing "Could not obtain
block:...".
The failures counter needs to be a little smarter. Would a patch that adds a
map of blocks to failure counts be the right way to go? Failures should note
the datanode that the failure was gotten against so that if the datanode came
online again (retry), we could decrement the mark that had made against the
block?
What do folks think?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.