[ 
https://issues.apache.org/jira/browse/HDFS-9950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lin Yiqun updated HDFS-9950:
----------------------------
    Resolution: Duplicate
        Status: Resolved  (was: Patch Available)

> TestDecommissioningStatus fails intermittently in trunk
> -------------------------------------------------------
>
>                 Key: HDFS-9950
>                 URL: https://issues.apache.org/jira/browse/HDFS-9950
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: test
>            Reporter: Lin Yiqun
>            Assignee: Lin Yiqun
>         Attachments: HDFS-9950.001.patch
>
>
> I often found that the testcase {{TestDecommissioningStatus}} failed 
> sometimes. And I looked the test failed report, it always show these error 
> infos:
> {code}
> testDecommissionStatus(org.apache.hadoop.hdfs.server.namenode.TestDecommissioningStatus)
>   Time elapsed: 0.462 sec  <<< FAILURE!
> java.lang.AssertionError: Unexpected num under-replicated blocks expected:<3> 
> but was:<4>
>       at org.junit.Assert.fail(Assert.java:88)
>       at org.junit.Assert.failNotEquals(Assert.java:743)
>       at org.junit.Assert.assertEquals(Assert.java:118)
>       at org.junit.Assert.assertEquals(Assert.java:555)
>       at 
> org.apache.hadoop.hdfs.server.namenode.TestDecommissioningStatus.checkDecommissionStatus(TestDecommissioningStatus.java:196)
>       at 
> org.apache.hadoop.hdfs.server.namenode.TestDecommissioningStatus.testDecommissionStatus(TestDecommissioningStatus.java:291)
> {code}
> And I know the reason is that the under-replicated num is not correct in 
> method checkDecommissionStatus of 
> {{TestDecommissioningStatus#testDecommissionStatus}}. 
> In this testcase, each datanode should have 4 blocks(2 for decommission.dat, 
> 2 for decommission.dat1)The expect num 3 on first node is because the 
> lastBlock of  uc blockCollection can not be replicated if its numlive just 
> more than blockManager minReplication(in this case is 1). And before decommed 
> second datanode, it has already one live replication for the uc 
> blockCollection' lastBlock in this node. 
> So in this failed case, the first node's under-replicat changes to 4 
> indicated that the uc blockCollection lastBlock's livenum is already 0 before 
> the second datanode decommed. So I think there are two possibilitys will lead 
> to it. 
> * The second datanode was already decommed before node one.
> * Creating file decommission.dat1 failed that lead that the second datanode 
> has no this block.
> And I read the code, it has checked the decommission-in-progress nodes here
> {code}
> if (iteration == 0) {
>         assertEquals(decommissioningNodes.size(), 1);
>         DatanodeDescriptor decommNode = decommissioningNodes.get(0);
>         checkDecommissionStatus(decommNode, 3, 0, 1);
>         checkDFSAdminDecommissionStatus(decommissioningNodes.subList(0, 1),
>             fileSys, admin);
>       }
> {code}
> So it seems the second possibility are more likely the reason. And in 
> addition, it hasn't did a block number check when finished the creating file. 
> So we could do a check and retry operatons here if block number is not 
> correct as expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to