[
https://issues.apache.org/jira/browse/HDFS-16896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17680116#comment-17680116
]
Tom McCormick commented on HDFS-16896:
--------------------------------------
Our current theory
Hedged Read has a different code path than default functionality (regardless of
if a hedged read is ever actually invoked). There is an ignoreList that is used
to keep track of which node has been tried so the hedged read doesn’t try the
same node, but that list is never cleared. The default code path has a failure
loop of 3 times (after each failure, all 3 blocks should be tried again),
resulting in 12 block read attempts. In the hedged read case, all nodes are
added to the ignoreList that is never cleared, resulting in a total of 3 block
read attempts.
> HDFS Client hedged read has increased failure rate than without hedged read
> ---------------------------------------------------------------------------
>
> Key: HDFS-16896
> URL: https://issues.apache.org/jira/browse/HDFS-16896
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: hdfs-client
> Reporter: Tom McCormick
> Assignee: Tom McCormick
> Priority: Major
>
> When hedged read is enabled by HDFS client, we see an increased failure rate
> on reads.
> *stacktrace*
>
> {code:java}
> Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain
> block: BP-1183972111-10.197.192.88-1590025572374:blk_17114848218_16043459722
> file=/data/tracking/streaming/AdImpressionEvent/daily/2022/07/18/compaction_1/part-r-1914862.1658217125623.1362294472.orc
> at
> org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1077)
> at
> org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:1060)
> at
> org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:1039)
> at
> org.apache.hadoop.hdfs.DFSInputStream.hedgedFetchBlockByteRange(DFSInputStream.java:1365)
> at org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1572)
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1535)
> at org.apache.hadoop.fs.FSInputStream.readFully(FSInputStream.java:121)
> at
> org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:112)
> at
> org.apache.hadoop.fs.RetryingInputStream.lambda$readFully$3(RetryingInputStream.java:172)
> at org.apache.hadoop.fs.RetryPolicy.lambda$run$0(RetryPolicy.java:137)
> at org.apache.hadoop.fs.NoOpRetryPolicy.run(NoOpRetryPolicy.java:36)
> at org.apache.hadoop.fs.RetryPolicy.run(RetryPolicy.java:136)
> at
> org.apache.hadoop.fs.RetryingInputStream.readFully(RetryingInputStream.java:168)
> at
> org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:112)
> at
> org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:112)
> at
> io.trino.plugin.hive.orc.HdfsOrcDataSource.readInternal(HdfsOrcDataSource.java:76)
> ... 46 more
> {code}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]