[
https://issues.apache.org/jira/browse/HDFS-11353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yiqun Lin updated HDFS-11353:
-----------------------------
Description:
Currently there are many tests which start with {{TestDataNodeVolumeFailure*}}
frequently run timedout or failed. I found one failure test in recent Jenkins
building. The stack info:
{code}
org.apache.hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting.testSuccessiveVolumeFailures
java.util.concurrent.TimeoutException: Timed out waiting for DN to die
at
org.apache.hadoop.hdfs.DFSTestUtil.waitForDatanodeDeath(DFSTestUtil.java:702)
at
org.apache.hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting.testSuccessiveVolumeFailures(TestDataNodeVolumeFailureReporting.java:208)
{code}
The related codes:
{code}
/*
* Now fail the 2nd volume on the 3rd datanode. All its volumes
* are now failed and so it should report two volume failures
* and that it's no longer up. Only wait for two replicas since
* we'll never get a third.
*/
DataNodeTestUtils.injectDataDirFailure(dn3Vol2);
Path file3 = new Path("/test3");
DFSTestUtil.createFile(fs, file3, 1024, (short)3, 1L);
DFSTestUtil.waitReplication(fs, file3, (short)2);
// The DN should consider itself dead
DFSTestUtil.waitForDatanodeDeath(dns.get(2));
{code}
Here the code waits for the datanode failed all the volume and then become
dead. But it timed out. We would be better to compare that if all the volumes
are failed then wair for the datanode dead.
was:
Currently there are many tests which start with {{TestDataNodeVolumeFailure*}}
frequently run timedout or failed. I found one failure test in recent Jenkins
building. The stack info:
{code}
org.apache.hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting.testSuccessiveVolumeFailures
java.util.concurrent.TimeoutException: Timed out waiting for DN to die
at
org.apache.hadoop.hdfs.DFSTestUtil.waitForDatanodeDeath(DFSTestUtil.java:702)
at
org.apache.hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting.testSuccessiveVolumeFailures(TestDataNodeVolumeFailureReporting.java:208)
{code}
The related codes:
{code}
/*
* Now fail the 2nd volume on the 3rd datanode. All its volumes
* are now failed and so it should report two volume failures
* and that it's no longer up. Only wait for two replicas since
* we'll never get a third.
*/
DataNodeTestUtils.injectDataDirFailure(dn3Vol2);
Path file3 = new Path("/test3");
DFSTestUtil.createFile(fs, file3, 1024, (short)3, 1L);
DFSTestUtil.waitReplication(fs, file3, (short)2);
// The DN should consider itself dead
DFSTestUtil.waitForDatanodeDeath(dns.get(2));
{code}
Here the code waits for the datanode failed all the volume and then become
dead. But it timed out. We can do an additional operation
{{DataNodeTestUtils.checkDiskErrorSync}} to speed the error check for here. And
this has been done in many similar places after doing
{{DataNodeTestUtils.injectDataDirFailure}} in test
{{TestDataNodeVolumeFailure}}.
I suppose that recent {{TestDataNodeVolumeFailure*}} failure test can also be
improved by this.
> Improve the unit tests relevant to DataNode volume failure testing
> ------------------------------------------------------------------
>
> Key: HDFS-11353
> URL: https://issues.apache.org/jira/browse/HDFS-11353
> Project: Hadoop HDFS
> Issue Type: Improvement
> Affects Versions: 3.0.0-alpha2
> Reporter: Yiqun Lin
> Assignee: Yiqun Lin
> Attachments: HDFS-11353.001.patch
>
>
> Currently there are many tests which start with
> {{TestDataNodeVolumeFailure*}} frequently run timedout or failed. I found one
> failure test in recent Jenkins building. The stack info:
> {code}
> org.apache.hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting.testSuccessiveVolumeFailures
> java.util.concurrent.TimeoutException: Timed out waiting for DN to die
> at
> org.apache.hadoop.hdfs.DFSTestUtil.waitForDatanodeDeath(DFSTestUtil.java:702)
> at
> org.apache.hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting.testSuccessiveVolumeFailures(TestDataNodeVolumeFailureReporting.java:208)
> {code}
> The related codes:
> {code}
> /*
> * Now fail the 2nd volume on the 3rd datanode. All its volumes
> * are now failed and so it should report two volume failures
> * and that it's no longer up. Only wait for two replicas since
> * we'll never get a third.
> */
> DataNodeTestUtils.injectDataDirFailure(dn3Vol2);
> Path file3 = new Path("/test3");
> DFSTestUtil.createFile(fs, file3, 1024, (short)3, 1L);
> DFSTestUtil.waitReplication(fs, file3, (short)2);
> // The DN should consider itself dead
> DFSTestUtil.waitForDatanodeDeath(dns.get(2));
> {code}
> Here the code waits for the datanode failed all the volume and then become
> dead. But it timed out. We would be better to compare that if all the volumes
> are failed then wair for the datanode dead.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]