[
https://issues.apache.org/jira/browse/HDFS-16896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689437#comment-17689437
]
ASF GitHub Bot commented on HDFS-16896:
---------------------------------------
mccormickt12 commented on code in PR #5322:
URL: https://github.com/apache/hadoop/pull/5322#discussion_r1107919545
##########
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java:
##########
@@ -1337,7 +1352,11 @@ private void hedgedFetchBlockByteRange(LocatedBlock
block, long start,
} catch (InterruptedException ie) {
// Ignore and retry
}
- if (refetch) {
+ // if refetch is true then all nodes are in deadlist or ignorelist
+ // we should loop through all futures and remove them so we do not
Review Comment:
fixed comments. deadlist is actually deadNodes (I fixed that comment as
well.)
When connections fail (in both hedged and non hedged code path) nodes are
added to the deadNodes collection to try other nodes. Once `chooseDataNode`
returns `null` (or more accurately `getBestNodeDNAddrPair`) it calls
`refetchLocations` which clears the deadNodes `clearLocalDeadNodes()` and now
with my change, also clears the ignore list.
Note we have added an assumption to this method `refetchLocations`. The
comment I added to `refetchLocations`
```
/**
* RefetchLocations should only be called when there are no active requests
* to datanodes. In the hedged read case this means futures should be empty
*/
```
> HDFS Client hedged read has increased failure rate than without hedged read
> ---------------------------------------------------------------------------
>
> Key: HDFS-16896
> URL: https://issues.apache.org/jira/browse/HDFS-16896
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: hdfs-client
> Reporter: Tom McCormick
> Assignee: Tom McCormick
> Priority: Major
> Labels: pull-request-available
>
> When hedged read is enabled by HDFS client, we see an increased failure rate
> on reads.
> *stacktrace*
>
> {code:java}
> Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain
> block: BP-1183972111-10.197.192.88-1590025572374:blk_17114848218_16043459722
> file=/data/tracking/streaming/AdImpressionEvent/daily/2022/07/18/compaction_1/part-r-1914862.1658217125623.1362294472.orc
> at
> org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1077)
> at
> org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:1060)
> at
> org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:1039)
> at
> org.apache.hadoop.hdfs.DFSInputStream.hedgedFetchBlockByteRange(DFSInputStream.java:1365)
> at org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1572)
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1535)
> at org.apache.hadoop.fs.FSInputStream.readFully(FSInputStream.java:121)
> at
> org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:112)
> at
> org.apache.hadoop.fs.RetryingInputStream.lambda$readFully$3(RetryingInputStream.java:172)
> at org.apache.hadoop.fs.RetryPolicy.lambda$run$0(RetryPolicy.java:137)
> at org.apache.hadoop.fs.NoOpRetryPolicy.run(NoOpRetryPolicy.java:36)
> at org.apache.hadoop.fs.RetryPolicy.run(RetryPolicy.java:136)
> at
> org.apache.hadoop.fs.RetryingInputStream.readFully(RetryingInputStream.java:168)
> at
> org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:112)
> at
> org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:112)
> at
> io.trino.plugin.hive.orc.HdfsOrcDataSource.readInternal(HdfsOrcDataSource.java:76)
> ... 46 more
> {code}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]