[
https://issues.apache.org/jira/browse/HDFS-9950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15213344#comment-15213344
]
Lin Yiqun commented on HDFS-9950:
---------------------------------
Duplicate to HDFS-9599, HDFS-9599 did a great analysation. Can see the latest
patch there.
> TestDecommissioningStatus fails intermittently in trunk
> -------------------------------------------------------
>
> Key: HDFS-9950
> URL: https://issues.apache.org/jira/browse/HDFS-9950
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: test
> Reporter: Lin Yiqun
> Assignee: Lin Yiqun
> Attachments: HDFS-9950.001.patch
>
>
> I often found that the testcase {{TestDecommissioningStatus}} failed
> sometimes. And I looked the test failed report, it always show these error
> infos:
> {code}
> testDecommissionStatus(org.apache.hadoop.hdfs.server.namenode.TestDecommissioningStatus)
> Time elapsed: 0.462 sec <<< FAILURE!
> java.lang.AssertionError: Unexpected num under-replicated blocks expected:<3>
> but was:<4>
> at org.junit.Assert.fail(Assert.java:88)
> at org.junit.Assert.failNotEquals(Assert.java:743)
> at org.junit.Assert.assertEquals(Assert.java:118)
> at org.junit.Assert.assertEquals(Assert.java:555)
> at
> org.apache.hadoop.hdfs.server.namenode.TestDecommissioningStatus.checkDecommissionStatus(TestDecommissioningStatus.java:196)
> at
> org.apache.hadoop.hdfs.server.namenode.TestDecommissioningStatus.testDecommissionStatus(TestDecommissioningStatus.java:291)
> {code}
> And I know the reason is that the under-replicated num is not correct in
> method checkDecommissionStatus of
> {{TestDecommissioningStatus#testDecommissionStatus}}.
> In this testcase, each datanode should have 4 blocks(2 for decommission.dat,
> 2 for decommission.dat1)The expect num 3 on first node is because the
> lastBlock of uc blockCollection can not be replicated if its numlive just
> more than blockManager minReplication(in this case is 1). And before decommed
> second datanode, it has already one live replication for the uc
> blockCollection' lastBlock in this node.
> So in this failed case, the first node's under-replicat changes to 4
> indicated that the uc blockCollection lastBlock's livenum is already 0 before
> the second datanode decommed. So I think there are two possibilitys will lead
> to it.
> * The second datanode was already decommed before node one.
> * Creating file decommission.dat1 failed that lead that the second datanode
> has no this block.
> And I read the code, it has checked the decommission-in-progress nodes here
> {code}
> if (iteration == 0) {
> assertEquals(decommissioningNodes.size(), 1);
> DatanodeDescriptor decommNode = decommissioningNodes.get(0);
> checkDecommissionStatus(decommNode, 3, 0, 1);
> checkDFSAdminDecommissionStatus(decommissioningNodes.subList(0, 1),
> fileSys, admin);
> }
> {code}
> So it seems the second possibility are more likely the reason. And in
> addition, it hasn't did a block number check when finished the creating file.
> So we could do a check and retry operatons here if block number is not
> correct as expected.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)