[ 
https://issues.apache.org/jira/browse/HDFS-8729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14619035#comment-14619035
 ] 

Jing Zhao commented on HDFS-8729:
---------------------------------

To trigger block report or not before restarting DataNodes may test different 
code paths: if DNs send report to NN before restarting, it is very possible 
that the truncate can be done before the restarting. Otherwise the recovery 
process may happen after DN restarts. In these two scenarios the block replicas 
reported from DN, and the block info stored in NN, can have different states 
when the restarted DNs send their first block reports to NN.

In my test looks like the reason of the timeout is a race scenario in the block 
recovery process: the second dn sends block report after the block truncation 
is finished thus its replica is marked as corrupted. However the replication 
monitor cannot schedule an extra replica because there are only 3 datanodes in 
the test. So maybe a quick fix is to change the total number of DN from 3 to 4. 
What do you think, Walter?

> Fix testTruncateWithDataNodesRestartImmediately occasionally failed
> -------------------------------------------------------------------
>
>                 Key: HDFS-8729
>                 URL: https://issues.apache.org/jira/browse/HDFS-8729
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Walter Su
>            Assignee: Walter Su
>            Priority: Minor
>         Attachments: HDFS-8729.01.patch
>
>
> https://builds.apache.org/job/PreCommit-HDFS-Build/11449/testReport/
> https://builds.apache.org/job/PreCommit-HDFS-Build/11593/testReport/
> https://builds.apache.org/job/PreCommit-HDFS-Build/11596/testReport/
> https://builds.apache.org/job/PreCommit-HDFS-Build/11599/testReport/
> {noformat}
> java.util.concurrent.TimeoutException: Timed out waiting for 
> /test/testTruncateWithDataNodesRestartImmediately to reach 3 replicas
>       at 
> org.apache.hadoop.hdfs.DFSTestUtil.waitReplication(DFSTestUtil.java:761)
>       at 
> org.apache.hadoop.hdfs.server.namenode.TestFileTruncate.testTruncateWithDataNodesRestartImmediately(TestFileTruncate.java:814)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to