Michael Rose created HDFS-10597:
-----------------------------------
Summary: DFSClient hangs if using hedged reads and all but one
eligible replica is down
Key: HDFS-10597
URL: https://issues.apache.org/jira/browse/HDFS-10597
Project: Hadoop HDFS
Issue Type: Bug
Components: hdfs-client
Affects Versions: 2.7.0, 2.6.0
Reporter: Michael Rose
If hedged reads are enabled, even if there is only a single datanode available,
the hedged read loop will respect the ignored nodes list and never send more
than one request, but retry for quite some time choosing a datanode.
This is unfortunate, as the ignored nodes list is only ever added to and never
removed from in the scope of a single request, therefore a single failed read
fails the entire request *or* delays responses.
There's actually a secondary undesirable behavior here too. To set the stage,
lets say 10ms is the hedged read timeout and we only have a single replica
available. If a hedged read can't find a datanode, it will delay a successful
response considerably.
1. [0ms] `DFSInputStream#hedgedFetchBlockByteRange` First (not-hedged) read is
sent to DN1, read takes 50ms to succeed. ignoredNodes=[DN1]
2. [+10ms] `DFSInputStream#chooseDataNode` is called. As ignoredNodes includes
DN1, we re-query the NameNode for block locations and sleep, trying again.
3. [+3000ms] `DFSInputStream#chooseDataNode` is called. As ignoredNodes
includes DN1, we re-query the NameNode for block locations and sleep, trying
again.
3. [+3000+6000ms] `DFSInputStream#chooseDataNode` is called. As ignoredNodes
includes DN1, we re-query the NameNode for block locations and sleep, trying
again.
4. [+6000ms+9000ms] `DFSInputStream#chooseDataNode` is called. As ignoredNodes
includes DN1, we re-query the NameNode for block locations and sleep, trying
again.
5. [27010ms] Control flow restored to
`DFSInputStream#hedgedFetchBlockByteRange`, completion service is polled and
read that succeeded at [50ms] returned successfully, except +27000ms extra
(worst case, expected value would be half).
This is only one scenario (a happy scenario). Supposing that the first read
eventually fails, the DFSClient will still retry inside of
`DFSInputStream#hedgedFetchBlockByteRange` for the same retries before failing.
I've identified one way to fix the behavior, but I'd be interested in thoughts:
`DFSInputStream#getBestNodeDNAddrPair`, there's a check to see if a node is in
the ignored list before allowing it to be returned. Amending this check to
short-circuit if there's only a single available node avoids the regrettably
useless retries, that is:
`nodes.length == 1 || ignoredNodes == null || !ignoredNodes.contains(nodes[i])`
However, with this change, if there's only one DN available, it'll send the
hedged request to it as well. Better behavior would be to fail hedged requests
quickly *or* push the waiting work into the hedge pool so that successful, fast
reads aren't blocked by this issue.
In our situation, we run a HBase cluster with HDFS RF=2 and hedged reads
enabled, stopping a single datanode leads to the cluster coming to a grinding
halt.
You can observe this behavior yourself by editing
TestPread#testMaxOutHedgedReadPool's MiniDFSCluster to have a single datanode.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]