[
https://issues.apache.org/jira/browse/HDFS-6591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14046193#comment-14046193
]
Chris Nauroth commented on HDFS-6591:
-------------------------------------
I agree with Andrew. The problem with publishing a metric is that technically
it becomes part of our contract. If tools start querying the metric (even
accidentally), then removing it in a later version is backwards-incompatible.
I'm holding off on committing anyway, because I may have found another problem.
I wanted to suggest removing the {{future.get()}} from here:
{code}
if (future != null) {
future.get();
return;
}
{code}
{{CompletionService#poll}} guarantees that the returned task (if not null) has
completed. We don't need the result of the task, so the {{get}} should be
unnecessary. However, when I remove that line, I start getting a test failure
in {{TestPread#testPreadDFS}}:
{code}
testPreadDFS(org.apache.hadoop.hdfs.TestPread) Time elapsed: 2.127 sec <<<
FAILURE!
java.lang.AssertionError: Pread Datanode Restart Test byte 0 differs. expected
47 actual 0 expected:<0> but was:<47>
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:555)
at org.apache.hadoop.hdfs.TestPread.checkAndEraseData(TestPread.java:92)
at
org.apache.hadoop.hdfs.TestPread.datanodeRestartTest(TestPread.java:225)
at org.apache.hadoop.hdfs.TestPread.dfsPreadTest(TestPread.java:439)
at org.apache.hadoop.hdfs.TestPread.testPreadDFS(TestPread.java:250)
{code}
I've isolated the problem to 2 tests running in sequence:
{{testMaxOutHedgedReadPool}} followed by {{testPreadDFS}}. Removing
{{future.get()}} shouldn't make a difference, so this might indicate a race
condition that had been masked by the {{future.get()}} taking up a few extra
cycles. Something about the way {{testMaxOutHedgedReadPool}} fills the thread
pool seems to set up the timing conditions just right to trigger the test
failure.
I'm not yet certain what's causing this, but I have a theory. The return value
of {{getFromOneDataNode}} gets submitted as a {{Future}}. As you pointed out,
that code mutates the buffer that was passed in. If we've already returned to
the caller, and then a background task lands late and starts mutating the
buffer, then we could see unexpected results. We do cancel the unstarted
tasks, but we don't interrupt them if they're already running. Even if we did
interrupt them, I don't think we could guarantee interruption before it mutates
the buffer.
Let me know your thoughts on this. Thanks again for working on this tricky
patch!
> while loop is executed tens of thousands of times in Hedged Read
> ------------------------------------------------------------------
>
> Key: HDFS-6591
> URL: https://issues.apache.org/jira/browse/HDFS-6591
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: hdfs-client
> Affects Versions: 2.4.0
> Reporter: LiuLei
> Assignee: Liang Xie
> Attachments: HDFS-6591-v2.txt, HDFS-6591-v3.txt, HDFS-6591.txt,
> LoopTooManyTimesTestCase.patch
>
>
> I download hadoop-2.4.1-rc1 code from
> http://people.apache.org/~acmurthy/hadoop-2.4.1-rc1/, I test the Hedged
> Read. I find the while loop in hedgedFetchBlockByteRange method is executed
> tens of thousands of times.
--
This message was sent by Atlassian JIRA
(v6.2#6252)