[ 
https://issues.apache.org/jira/browse/HDFS-6591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14046193#comment-14046193
 ] 

Chris Nauroth commented on HDFS-6591:
-------------------------------------

I agree with Andrew.  The problem with publishing a metric is that technically 
it becomes part of our contract.  If tools start querying the metric (even 
accidentally), then removing it in a later version is backwards-incompatible.

I'm holding off on committing anyway, because I may have found another problem. 
 I wanted to suggest removing the {{future.get()}} from here:

{code}
          if (future != null) {
            future.get();
            return;
          }
{code}

{{CompletionService#poll}} guarantees that the returned task (if not null) has 
completed.  We don't need the result of the task, so the {{get}} should be 
unnecessary.  However, when I remove that line, I start getting a test failure 
in {{TestPread#testPreadDFS}}:

{code}
testPreadDFS(org.apache.hadoop.hdfs.TestPread)  Time elapsed: 2.127 sec  <<< 
FAILURE!
java.lang.AssertionError: Pread Datanode Restart Test byte 0 differs. expected 
47 actual 0 expected:<0> but was:<47>
        at org.junit.Assert.fail(Assert.java:88)
        at org.junit.Assert.failNotEquals(Assert.java:743)
        at org.junit.Assert.assertEquals(Assert.java:118)
        at org.junit.Assert.assertEquals(Assert.java:555)
        at org.apache.hadoop.hdfs.TestPread.checkAndEraseData(TestPread.java:92)
        at 
org.apache.hadoop.hdfs.TestPread.datanodeRestartTest(TestPread.java:225)
        at org.apache.hadoop.hdfs.TestPread.dfsPreadTest(TestPread.java:439)
        at org.apache.hadoop.hdfs.TestPread.testPreadDFS(TestPread.java:250)
{code}

I've isolated the problem to 2 tests running in sequence: 
{{testMaxOutHedgedReadPool}} followed by {{testPreadDFS}}.  Removing 
{{future.get()}} shouldn't make a difference, so this might indicate a race 
condition that had been masked by the {{future.get()}} taking up a few extra 
cycles.  Something about the way {{testMaxOutHedgedReadPool}} fills the thread 
pool seems to set up the timing conditions just right to trigger the test 
failure.

I'm not yet certain what's causing this, but I have a theory.  The return value 
of {{getFromOneDataNode}} gets submitted as a {{Future}}.  As you pointed out, 
that code mutates the buffer that was passed in.  If we've already returned to 
the caller, and then a background task lands late and starts mutating the 
buffer, then we could see unexpected results.  We do cancel the unstarted 
tasks, but we don't interrupt them if they're already running.  Even if we did 
interrupt them, I don't think we could guarantee interruption before it mutates 
the buffer.

Let me know your thoughts on this.  Thanks again for working on this tricky 
patch!

> while loop is executed tens of thousands of times  in Hedged  Read
> ------------------------------------------------------------------
>
>                 Key: HDFS-6591
>                 URL: https://issues.apache.org/jira/browse/HDFS-6591
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs-client
>    Affects Versions: 2.4.0
>            Reporter: LiuLei
>            Assignee: Liang Xie
>         Attachments: HDFS-6591-v2.txt, HDFS-6591-v3.txt, HDFS-6591.txt, 
> LoopTooManyTimesTestCase.patch
>
>
> I download hadoop-2.4.1-rc1 code from 
> http://people.apache.org/~acmurthy/hadoop-2.4.1-rc1/,  I test the  Hedged  
> Read. I find the while loop in hedgedFetchBlockByteRange method is executed 
> tens of thousands of times.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to