[ 
https://issues.apache.org/jira/browse/HDFS-8729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Walter Su updated HDFS-8729:
----------------------------
    Attachment: HDFS-8729.02.patch

bq. In my test looks like the reason of the timeout is a race scenario in the 
block recovery process: the second dn sends block report after the block 
truncation is finished thus its replica is marked as corrupted. However the 
replication monitor cannot schedule an extra replica because there are only 3 
datanodes in the test. 

You are right. I knew one is corrupted, didn't know it's the second one. Thank 
you for thorough analysis!

What I'm doing in 01 patch is to trigger the *second* time blockReport so the 
corrupted block can get deleted on dn1. So ReplicationMonitor can schedule 
copying the block to dn1.

bq. To trigger block report or not before restarting DataNodes...

That's not what I do. In the 01 patch, checkBlockRecovery(p) will make sure 
truncation is completed. triggerBlockReports() is for *second* time blockReport.
{code}
    cluster.waitActive();
    checkBlockRecovery(p);
    ...
    assertEquals(newBlock.getBlock().getGenerationStamp(),
        oldBlock.getBlock().getGenerationStamp() + 1);

+    cluster.triggerBlockReports();
     // Wait replicas come to 3
     DFSTestUtil.waitReplication(fs, p, REPLICATION);
{code}

bq. it is very possible that the truncate can be done before the restarting.
That's very unlikely, because fs.truncate(p, newLength); is non-blocking.
{code}
    boolean isReady = fs.truncate(p, newLength); // non-blocking
    assertFalse(isReady);

    cluster.restartDataNode(dn0, true, true); // shutdown, restart and sends 
registration
    cluster.restartDataNode(dn1, true, true); // shutdown, restart and sends 
registration
    cluster.waitActive(); // wait until dn0,dn1 got response from NN about the 
registration
// dn0 or dn1 got DNA_RECOVERY command
{code}

bq. So maybe a quick fix is to change the total number of DN from 3 to 4.
It works too. I prefer my approach. Even though with my approach the time 
spending on DFSTestUtil.waitReplication(..) is 4-6 seconds longer. (waiting 
deletion and copy)
It worth it. Because the purpose of the test case is to schedule block recovery 
to dn0/dn1, which got restarted. Increasing the number of DNs will lower the 
chance.

Uploaded 02 patch. add {{Thread.sleep(2000)}} to make sure it's the second BR.

> Fix testTruncateWithDataNodesRestartImmediately occasionally failed
> -------------------------------------------------------------------
>
>                 Key: HDFS-8729
>                 URL: https://issues.apache.org/jira/browse/HDFS-8729
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Walter Su
>            Assignee: Walter Su
>            Priority: Minor
>         Attachments: HDFS-8729.01.patch, HDFS-8729.02.patch
>
>
> https://builds.apache.org/job/PreCommit-HDFS-Build/11449/testReport/
> https://builds.apache.org/job/PreCommit-HDFS-Build/11593/testReport/
> https://builds.apache.org/job/PreCommit-HDFS-Build/11596/testReport/
> https://builds.apache.org/job/PreCommit-HDFS-Build/11599/testReport/
> {noformat}
> java.util.concurrent.TimeoutException: Timed out waiting for 
> /test/testTruncateWithDataNodesRestartImmediately to reach 3 replicas
>       at 
> org.apache.hadoop.hdfs.DFSTestUtil.waitReplication(DFSTestUtil.java:761)
>       at 
> org.apache.hadoop.hdfs.server.namenode.TestFileTruncate.testTruncateWithDataNodesRestartImmediately(TestFileTruncate.java:814)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to