[ https://issues.apache.org/jira/browse/HDFS-9950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lin Yiqun updated HDFS-9950: ---------------------------- Resolution: Duplicate Status: Resolved (was: Patch Available) > TestDecommissioningStatus fails intermittently in trunk > ------------------------------------------------------- > > Key: HDFS-9950 > URL: https://issues.apache.org/jira/browse/HDFS-9950 > Project: Hadoop HDFS > Issue Type: Bug > Components: test > Reporter: Lin Yiqun > Assignee: Lin Yiqun > Attachments: HDFS-9950.001.patch > > > I often found that the testcase {{TestDecommissioningStatus}} failed > sometimes. And I looked the test failed report, it always show these error > infos: > {code} > testDecommissionStatus(org.apache.hadoop.hdfs.server.namenode.TestDecommissioningStatus) > Time elapsed: 0.462 sec <<< FAILURE! > java.lang.AssertionError: Unexpected num under-replicated blocks expected:<3> > but was:<4> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at > org.apache.hadoop.hdfs.server.namenode.TestDecommissioningStatus.checkDecommissionStatus(TestDecommissioningStatus.java:196) > at > org.apache.hadoop.hdfs.server.namenode.TestDecommissioningStatus.testDecommissionStatus(TestDecommissioningStatus.java:291) > {code} > And I know the reason is that the under-replicated num is not correct in > method checkDecommissionStatus of > {{TestDecommissioningStatus#testDecommissionStatus}}. > In this testcase, each datanode should have 4 blocks(2 for decommission.dat, > 2 for decommission.dat1)The expect num 3 on first node is because the > lastBlock of uc blockCollection can not be replicated if its numlive just > more than blockManager minReplication(in this case is 1). And before decommed > second datanode, it has already one live replication for the uc > blockCollection' lastBlock in this node. > So in this failed case, the first node's under-replicat changes to 4 > indicated that the uc blockCollection lastBlock's livenum is already 0 before > the second datanode decommed. So I think there are two possibilitys will lead > to it. > * The second datanode was already decommed before node one. > * Creating file decommission.dat1 failed that lead that the second datanode > has no this block. > And I read the code, it has checked the decommission-in-progress nodes here > {code} > if (iteration == 0) { > assertEquals(decommissioningNodes.size(), 1); > DatanodeDescriptor decommNode = decommissioningNodes.get(0); > checkDecommissionStatus(decommNode, 3, 0, 1); > checkDFSAdminDecommissionStatus(decommissioningNodes.subList(0, 1), > fileSys, admin); > } > {code} > So it seems the second possibility are more likely the reason. And in > addition, it hasn't did a block number check when finished the creating file. > So we could do a check and retry operatons here if block number is not > correct as expected. -- This message was sent by Atlassian JIRA (v6.3.4#6332)