[
https://issues.apache.org/jira/browse/HDFS-8729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Walter Su updated HDFS-8729:
----------------------------
Attachment: HDFS-8729.02.patch
bq. In my test looks like the reason of the timeout is a race scenario in the
block recovery process: the second dn sends block report after the block
truncation is finished thus its replica is marked as corrupted. However the
replication monitor cannot schedule an extra replica because there are only 3
datanodes in the test.
You are right. I knew one is corrupted, didn't know it's the second one. Thank
you for thorough analysis!
What I'm doing in 01 patch is to trigger the *second* time blockReport so the
corrupted block can get deleted on dn1. So ReplicationMonitor can schedule
copying the block to dn1.
bq. To trigger block report or not before restarting DataNodes...
That's not what I do. In the 01 patch, checkBlockRecovery(p) will make sure
truncation is completed. triggerBlockReports() is for *second* time blockReport.
{code}
cluster.waitActive();
checkBlockRecovery(p);
...
assertEquals(newBlock.getBlock().getGenerationStamp(),
oldBlock.getBlock().getGenerationStamp() + 1);
+ cluster.triggerBlockReports();
// Wait replicas come to 3
DFSTestUtil.waitReplication(fs, p, REPLICATION);
{code}
bq. it is very possible that the truncate can be done before the restarting.
That's very unlikely, because fs.truncate(p, newLength); is non-blocking.
{code}
boolean isReady = fs.truncate(p, newLength); // non-blocking
assertFalse(isReady);
cluster.restartDataNode(dn0, true, true); // shutdown, restart and sends
registration
cluster.restartDataNode(dn1, true, true); // shutdown, restart and sends
registration
cluster.waitActive(); // wait until dn0,dn1 got response from NN about the
registration
// dn0 or dn1 got DNA_RECOVERY command
{code}
bq. So maybe a quick fix is to change the total number of DN from 3 to 4.
It works too. I prefer my approach. Even though with my approach the time
spending on DFSTestUtil.waitReplication(..) is 4-6 seconds longer. (waiting
deletion and copy)
It worth it. Because the purpose of the test case is to schedule block recovery
to dn0/dn1, which got restarted. Increasing the number of DNs will lower the
chance.
Uploaded 02 patch. add {{Thread.sleep(2000)}} to make sure it's the second BR.
> Fix testTruncateWithDataNodesRestartImmediately occasionally failed
> -------------------------------------------------------------------
>
> Key: HDFS-8729
> URL: https://issues.apache.org/jira/browse/HDFS-8729
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: Walter Su
> Assignee: Walter Su
> Priority: Minor
> Attachments: HDFS-8729.01.patch, HDFS-8729.02.patch
>
>
> https://builds.apache.org/job/PreCommit-HDFS-Build/11449/testReport/
> https://builds.apache.org/job/PreCommit-HDFS-Build/11593/testReport/
> https://builds.apache.org/job/PreCommit-HDFS-Build/11596/testReport/
> https://builds.apache.org/job/PreCommit-HDFS-Build/11599/testReport/
> {noformat}
> java.util.concurrent.TimeoutException: Timed out waiting for
> /test/testTruncateWithDataNodesRestartImmediately to reach 3 replicas
> at
> org.apache.hadoop.hdfs.DFSTestUtil.waitReplication(DFSTestUtil.java:761)
> at
> org.apache.hadoop.hdfs.server.namenode.TestFileTruncate.testTruncateWithDataNodesRestartImmediately(TestFileTruncate.java:814)
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)