[
https://issues.apache.org/jira/browse/HDFS-10597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael Rose updated HDFS-10597:
--------------------------------
Affects Version/s: 2.4.0
2.5.0
> DFSClient hangs if using hedged reads and all but one eligible replica is
> down
> -------------------------------------------------------------------------------
>
> Key: HDFS-10597
> URL: https://issues.apache.org/jira/browse/HDFS-10597
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: hdfs-client
> Affects Versions: 2.4.0, 2.5.0, 2.6.0, 2.7.0
> Reporter: Michael Rose
>
> If hedged reads are enabled, even if there is only a single datanode
> available, the hedged read loop will respect the ignored nodes list and never
> send more than one request, but retry for quite some time choosing a datanode.
> This is unfortunate, as the ignored nodes list is only ever added to and
> never removed from in the scope of a single request, therefore a single
> failed read fails the entire request *or* delays responses.
> There's actually a secondary undesirable behavior here too. If a hedged read
> can't find a datanode, it will delay a successful response considerably. To
> set the stage, lets say 10ms is the hedged read timeout and we only have a
> single replica available, that is, nodes=[DN1].
> 1. [0ms] {{DFSInputStream#hedgedFetchBlockByteRange}} First (not-hedged) read
> is sent to DN1. In the future, the read takes 50ms to succeed.
> ignoredNodes=[DN1]
> 2. [10ms] Poll timeout. Send hedged request
> 3. [10ms] {{DFSInputStream#chooseDataNode}} is called to find a node for the
> hedged request. As ignoredNodes includes DN1, there are no nodes available
> and we re-query the NameNode for block locations and sleep, trying again.
> 4. [+3000ms] {{DFSInputStream#chooseDataNode}} is called. As ignoredNodes
> includes DN1, we re-query the NameNode for block locations and sleep, trying
> again.
> 5. [+3000+6000ms] {{DFSInputStream#chooseDataNode}} is called. As
> ignoredNodes includes DN1, we re-query the NameNode for block locations and
> sleep, trying again.
> 6. [+6000ms+9000ms] {{DFSInputStream#chooseDataNode}} is called. As
> ignoredNodes includes DN1, we re-query the NameNode for block locations and
> sleep, trying again.
> 7. [27010ms] Control flow restored to
> {{DFSInputStream#hedgedFetchBlockByteRange}}, completion service is polled
> and read that succeeded at [50ms] returned successfully, except +27000ms
> extra (worst case, expected value would be half).
> This is only one scenario (a happy scenario). Supposing that the first read
> eventually fails, the DFSClient will still retry inside of
> {{DFSInputStream#hedgedFetchBlockByteRange}} for the same retries before
> failing.
> I've identified one way to fix the behavior, but I'd be interested in
> thoughts:
> {{DFSInputStream#getBestNodeDNAddrPair}}, there's a check to see if a node is
> in the ignored list before allowing it to be returned. Amending this check to
> short-circuit if there's only a single available node avoids the regrettably
> useless retries, that is:
> {{nodes.length == 1 || ignoredNodes == null ||
> !ignoredNodes.contains(nodes[i])}}
> However, with this change, if there's only one DN available, it'll send the
> hedged request to it as well. Better behavior would be to fail hedged
> requests quickly *or* push the waiting work into the hedge pool so that
> successful, fast reads aren't blocked by this issue.
> In our situation, we run a HBase cluster with HDFS RF=2 and hedged reads
> enabled, stopping a single datanode leads to the cluster coming to a grinding
> halt.
> You can observe this behavior yourself by editing
> {{TestPread#testMaxOutHedgedReadPool}}'s MiniDFSCluster to have a single
> datanode.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]