[ 
https://issues.apache.org/jira/browse/HDFS-10597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Rose updated HDFS-10597:
--------------------------------
    Affects Version/s: 2.4.0
                       2.5.0

> DFSClient hangs if using hedged reads and all but one eligible replica is 
> down 
> -------------------------------------------------------------------------------
>
>                 Key: HDFS-10597
>                 URL: https://issues.apache.org/jira/browse/HDFS-10597
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs-client
>    Affects Versions: 2.4.0, 2.5.0, 2.6.0, 2.7.0
>            Reporter: Michael Rose
>
> If hedged reads are enabled, even if there is only a single datanode 
> available, the hedged read loop will respect the ignored nodes list and never 
> send more than one request, but retry for quite some time choosing a datanode.
> This is unfortunate, as the ignored nodes list is only ever added to and 
> never removed from in the scope of a single request, therefore a single 
> failed read fails the entire request *or* delays responses.
> There's actually a secondary undesirable behavior here too. If a hedged read 
> can't find a datanode, it will delay a successful response considerably. To 
> set the stage, lets say 10ms is the hedged read timeout and we only have a 
> single replica available, that is, nodes=[DN1]. 
> 1. [0ms] {{DFSInputStream#hedgedFetchBlockByteRange}} First (not-hedged) read 
> is sent to DN1. In the future, the read takes 50ms to succeed. 
> ignoredNodes=[DN1]
> 2. [10ms] Poll timeout. Send hedged request
> 3. [10ms] {{DFSInputStream#chooseDataNode}} is called to find a node for the 
> hedged request. As ignoredNodes includes DN1, there are no nodes available 
> and we re-query the NameNode for block locations and sleep, trying again.
> 4. [+3000ms] {{DFSInputStream#chooseDataNode}} is called. As ignoredNodes 
> includes DN1, we re-query the NameNode for block locations and sleep, trying 
> again.
> 5. [+3000+6000ms] {{DFSInputStream#chooseDataNode}} is called. As 
> ignoredNodes includes DN1, we re-query the NameNode for block locations and 
> sleep, trying again.
> 6. [+6000ms+9000ms] {{DFSInputStream#chooseDataNode}} is called. As 
> ignoredNodes includes DN1, we re-query the NameNode for block locations and 
> sleep, trying again.
> 7. [27010ms] Control flow restored to 
> {{DFSInputStream#hedgedFetchBlockByteRange}}, completion service is polled 
> and read that succeeded at [50ms] returned successfully, except +27000ms 
> extra (worst case, expected value would be half).
> This is only one scenario (a happy scenario). Supposing that the first read 
> eventually fails, the DFSClient will still retry inside of 
> {{DFSInputStream#hedgedFetchBlockByteRange}} for the same retries before 
> failing.
> I've identified one way to fix the behavior, but I'd be interested in 
> thoughts:
> {{DFSInputStream#getBestNodeDNAddrPair}}, there's a check to see if a node is 
> in the ignored list before allowing it to be returned. Amending this check to 
> short-circuit if there's only a single available node avoids the regrettably 
> useless retries, that is:
> {{nodes.length == 1 || ignoredNodes == null || 
> !ignoredNodes.contains(nodes[i])}}
> However, with this change, if there's only one DN available, it'll send the 
> hedged request to it as well. Better behavior would be to fail hedged 
> requests quickly *or* push the waiting work into the hedge pool so that 
> successful, fast reads aren't blocked by this issue.
> In our situation, we run a HBase cluster with HDFS RF=2 and hedged reads 
> enabled, stopping a single datanode leads to the cluster coming to a grinding 
> halt.
> You can observe this behavior yourself by editing 
> {{TestPread#testMaxOutHedgedReadPool}}'s MiniDFSCluster to have a single 
> datanode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to