[jira] [Updated] (HDFS-11398) TestDataNodeVolumeFailure#testUnderReplicationAfterVolFailure still fails intermittently

Yiqun Lin (JIRA) Wed, 08 Feb 2017 03:06:56 -0800

     [ 
https://issues.apache.org/jira/browse/HDFS-11398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Yiqun Lin updated HDFS-11398:
-----------------------------
    Description: 
The test {{TestDataNodeVolumeFailure#testUnderReplicationAfterVolFailure}} 
still fails intermittently in trunk after HDFS-11316. The stack infos:
{code}
java.util.concurrent.TimeoutException: Timed out waiting for DN to die
        at 
org.apache.hadoop.hdfs.DFSTestUtil.waitForDatanodeDeath(DFSTestUtil.java:702)
        at 
org.apache.hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting.testSuccessiveVolumeFailures(TestDataNodeVolumeFailureReporting.java:218
{code}

I looked into this and found there is one chance that the vaule 
{{UnderReplicatedBlocksCount}} will be no longer > 0. The following is my 
analysation:
In test {{TestDataNodeVolumeFailure.testUnderReplicationAfterVolFailure}}, it 
uses creating file to trigger the disk error checking. The related codes:
{code}
    Path file1 = new Path("/test1");
    DFSTestUtil.createFile(fs, file1, 1024, (short)3, 1L);
    DFSTestUtil.waitReplication(fs, file1, (short)3);

    // Fail the first volume on both datanodes
    File dn1Vol1 = new File(dataDir, "data"+(2*0+1));
    File dn2Vol1 = new File(dataDir, "data"+(2*1+1));

    DataNodeTestUtils.injectDataDirFailure(dn1Vol1, dn2Vol1);
    Path file2 = new Path("/test2");
    DFSTestUtil.createFile(fs, file2, 1024, (short)3, 1L);
    DFSTestUtil.waitReplication(fs, file2, (short)3);
{code}
This will lead one problem: If the cluster is busy, and it costs long time to 
wait replication of file2 to be desired value. During this time, the under 
replication blocks of file1 can also be rereplication in cluster. If this is 
done, the condition {{underReplicatedBlocks > 0}} will never be  satisfied.
And this can be reproduced in my local env.

Actually, we can use a easy way {{DataNodeTestUtils.waitForDiskError}} to 
replace this, it runs fast and be more reliable.

  was:
The test {{TestDataNodeVolumeFailure#testUnderReplicationAfterVolFailure}} 
still fails intermittently in trunk after HDFS-11316. The stack infos:
{code}
java.util.concurrent.TimeoutException: Timed out waiting for DN to die
        at 
org.apache.hadoop.hdfs.DFSTestUtil.waitForDatanodeDeath(DFSTestUtil.java:702)
        at 
org.apache.hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting.testSuccessiveVolumeFailures(TestDataNodeVolumeFailureReporting.java:218
{code}

I looked into this and found there is one chance that the vaule 
{{UnderReplicatedBlocksCount}} will be no longer > 0. The following is my 
analysation:
In test {{TestDataNodeVolumeFailure.testUnderReplicationAfterVolFailure}}, it 
uses creating file to trigger the disk error checking. The related codes:
{code}
    Path file1 = new Path("/test1");
    DFSTestUtil.createFile(fs, file1, 1024, (short)3, 1L);
    DFSTestUtil.waitReplication(fs, file1, (short)3);

    // Fail the first volume on both datanodes
    File dn1Vol1 = new File(dataDir, "data"+(2*0+1));
    File dn2Vol1 = new File(dataDir, "data"+(2*1+1));

    DataNodeTestUtils.injectDataDirFailure(dn1Vol1, dn2Vol1);
    Path file2 = new Path("/test2");
    DFSTestUtil.createFile(fs, file2, 1024, (short)3, 1L);
    DFSTestUtil.waitReplication(fs, file2, (short)3);
{code}
This will lead one problem: If the cluster is busy, and it costs long time to 
wait replication of file2 to be desired value. And there is one chance that the 
under replication blocks of file1 can also be rereplication in cluster. If this 
is done, the condition {{underReplicatedBlocks > 0}} will never be  satisfied.
And this can be reproduced in my local env.

Actually, we can use a easy way {{DataNodeTestUtils.waitForDiskError}} to 
replace this, it runs fast and be more reliable.


> TestDataNodeVolumeFailure#testUnderReplicationAfterVolFailure still fails 
> intermittently
> ----------------------------------------------------------------------------------------
>
>                 Key: HDFS-11398
>                 URL: https://issues.apache.org/jira/browse/HDFS-11398
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 3.0.0-alpha2
>            Reporter: Yiqun Lin
>            Assignee: Yiqun Lin
>         Attachments: failure.log, HDFS-11398.001.patch, 
> HDFS-11398-reproduce.patch
>
>
> The test {{TestDataNodeVolumeFailure#testUnderReplicationAfterVolFailure}} 
> still fails intermittently in trunk after HDFS-11316. The stack infos:
> {code}
> java.util.concurrent.TimeoutException: Timed out waiting for DN to die
>       at 
> org.apache.hadoop.hdfs.DFSTestUtil.waitForDatanodeDeath(DFSTestUtil.java:702)
>       at 
> org.apache.hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting.testSuccessiveVolumeFailures(TestDataNodeVolumeFailureReporting.java:218
> {code}
> I looked into this and found there is one chance that the vaule 
> {{UnderReplicatedBlocksCount}} will be no longer > 0. The following is my 
> analysation:
> In test {{TestDataNodeVolumeFailure.testUnderReplicationAfterVolFailure}}, it 
> uses creating file to trigger the disk error checking. The related codes:
> {code}
>     Path file1 = new Path("/test1");
>     DFSTestUtil.createFile(fs, file1, 1024, (short)3, 1L);
>     DFSTestUtil.waitReplication(fs, file1, (short)3);
>     // Fail the first volume on both datanodes
>     File dn1Vol1 = new File(dataDir, "data"+(2*0+1));
>     File dn2Vol1 = new File(dataDir, "data"+(2*1+1));
>     DataNodeTestUtils.injectDataDirFailure(dn1Vol1, dn2Vol1);
>     Path file2 = new Path("/test2");
>     DFSTestUtil.createFile(fs, file2, 1024, (short)3, 1L);
>     DFSTestUtil.waitReplication(fs, file2, (short)3);
> {code}
> This will lead one problem: If the cluster is busy, and it costs long time to 
> wait replication of file2 to be desired value. During this time, the under 
> replication blocks of file1 can also be rereplication in cluster. If this is 
> done, the condition {{underReplicatedBlocks > 0}} will never be  satisfied.
> And this can be reproduced in my local env.
> Actually, we can use a easy way {{DataNodeTestUtils.waitForDiskError}} to 
> replace this, it runs fast and be more reliable.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDFS-11398) TestDataNodeVolumeFailure#testUnderReplicationAfterVolFailure still fails intermittently

Reply via email to