[
https://issues.apache.org/jira/browse/HDFS-11353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15833002#comment-15833002
]
Yiqun Lin commented on HDFS-11353:
----------------------------------
Thanks [~manojg] for the comments. Yes, you are right. Actually, creating file
will also trigger the {{handleDiskError}} logic. So I found one antoher way to
improve the logic for the test
{{TestDataNodeVolumeFailureReporting.testSuccessiveVolumeFailures}}.
In the original logic, it fails the 2nd volume on the 3rd datanode then
immediately wait the datanode to be dead. If every thing goes well, it should
satisify the condition. But in failure case, it runs error. So one way I am
thinking that can be improved is that we should do the failed volumes check.
Only when all the volume being failed, then the datanode will be shutdown.
In addition, in the latest patch I keep the change for {{checkDiskErrorSync}}
since that can be reused in the future and be good for us to test.
> Improve the unit tests relevant to DataNode volume failure testing
> ------------------------------------------------------------------
>
> Key: HDFS-11353
> URL: https://issues.apache.org/jira/browse/HDFS-11353
> Project: Hadoop HDFS
> Issue Type: Improvement
> Affects Versions: 3.0.0-alpha2
> Reporter: Yiqun Lin
> Assignee: Yiqun Lin
> Attachments: HDFS-11353.001.patch
>
>
> Currently there are many tests which start with
> {{TestDataNodeVolumeFailure*}} frequently run timedout or failed. I found one
> failure test in recent Jenkins building. The stack info:
> {code}
> org.apache.hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting.testSuccessiveVolumeFailures
> java.util.concurrent.TimeoutException: Timed out waiting for DN to die
> at
> org.apache.hadoop.hdfs.DFSTestUtil.waitForDatanodeDeath(DFSTestUtil.java:702)
> at
> org.apache.hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting.testSuccessiveVolumeFailures(TestDataNodeVolumeFailureReporting.java:208)
> {code}
> The related codes:
> {code}
> /*
> * Now fail the 2nd volume on the 3rd datanode. All its volumes
> * are now failed and so it should report two volume failures
> * and that it's no longer up. Only wait for two replicas since
> * we'll never get a third.
> */
> DataNodeTestUtils.injectDataDirFailure(dn3Vol2);
> Path file3 = new Path("/test3");
> DFSTestUtil.createFile(fs, file3, 1024, (short)3, 1L);
> DFSTestUtil.waitReplication(fs, file3, (short)2);
> // The DN should consider itself dead
> DFSTestUtil.waitForDatanodeDeath(dns.get(2));
> {code}
> Here the code waits for the datanode failed all the volume and then become
> dead. But it timed out. We would be better to compare that if all the volumes
> are failed then wair for the datanode dead.
> In addition, we can use the method {{checkDiskErrorSync}} to do the disk
> error check instead of creaing files. In this JIRA, I would like to extract
> this logic and defined that in {{DataNodeTestUtils}}. And then we can reuse
> this method for datanode volme failure testing in the future.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]