[
https://issues.apache.org/jira/browse/HDFS-6231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chris Nauroth updated HDFS-6231:
--------------------------------
Attachment: HDFS-6231.1.patch
I found this problem from observing runs of {{TestPread}} that were hanging.
It turns out that on most fast machines, {{TestPread}} doesn't actually end up
triggering a hedged read. The initial read completes before the hedged read
threshold, so we don't bother. On one of my slower VMs, I was seeing the test
hang. I was then able to repro even on my fast machines by aggressively
down-tuning the hedged read threshold.
Here is a patch to fix the bug.
# {{DFSInputStream#getFromOneDataNode}}: This was the main problem. The
returned {{Callable}} needs to release a {{CountDownLatch}}, but it wasn't
doing it in the failure case. It was only doing it in the success case. I
changed it to release the latch inside a finally clause.
# {{DFSInputStream#hedgedFetchBlockByteRange}}: After I applied the first
change, it exposed another problem here. If all datanodes die, then we need to
refetch block locations from the datanode. That wasn't happening, because this
code used the condition {{futures == null}} to decide whether or not to refetch
block locations via a call to {{chooseDataNode}}. After a hedged read has been
issued, {{futures}} is always non-null, so this wasn't sufficient. I changed
the code to check for empty {{futures}}. The reason this works is that
{{getFirstToComplete}} removes failed futures from the list. This means that
if all datanodes die, then {{futures}} drops back to an empty list, and then we
go into {{chooseDataNode}} to refetch block locations.
# In {{TestPread}}, I downtuned the hedged read threshold a lot so that this
test really does issue hedged reads even on fast machines. That ought to help
us catch regressions in the future. Now that hedged reads are really happening
during the test runs, I found that I needed to reset the metrics counts in
order to satisfy some assertions. This is required because the metrics
instance is static/global.
I've had multiple successful test runs of {{TestPread}} with this patch on both
my fast Mac and my slow Windows VM.
> DFSClient hangs infinitely if using hedged reads and all eligible datanodes
> die.
> --------------------------------------------------------------------------------
>
> Key: HDFS-6231
> URL: https://issues.apache.org/jira/browse/HDFS-6231
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: hdfs-client
> Affects Versions: 3.0.0, 2.4.0
> Reporter: Chris Nauroth
> Assignee: Chris Nauroth
> Attachments: HDFS-6231.1.patch
>
>
> When using hedged reads, and all eligible datanodes for the read get flagged
> as dead or ignored, then the client is supposed to refetch block locations
> from the NameNode to retry the read. Instead, we've seen that the client can
> hang indefinitely.
--
This message was sent by Atlassian JIRA
(v6.2#6252)