[
https://issues.apache.org/jira/browse/HDFS-9358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15006392#comment-15006392
]
Walter Su commented on HDFS-9358:
---------------------------------
1. We can set heartBeat interval to 1s to shorten running time.
2. I think the 001 can patch solve the posted issue. Firstly thanks for that.
However I think the race condition still exists?
{code}
125 cluster.restartDataNode(dnprop);
126 cluster.waitActive();
127
128 // check if excessive replica is detected (transient)
129 initializeTimeout(TIMEOUT);
130 while (countNodes(block.getLocalBlock(), namesystem).excessReplicas()
!= 2) {
131 checkTimeout("excess replica count not equal to 2");
132 }
{code}
The old code expects 2 excessReplicas. The 001 patch expects 1 excessReplicas.
No matter how many excessReplicas we want, as you can see from the comment, the
state is "transient". What if the state vanished before line 130? It's unlikely
I know but the jenkins machine is under heavy load, who knows?
So I think we can disable block invalidation by setting large delay to make it
non-transient, then the test is more stable. Check
{{InvalidateBlocks.getInvalidationDelay()}}. Then we solved the issue and the
test logic changes in 001 patch is not required. How do you think?
> TestNodeCount#testNodeCount timed out
> -------------------------------------
>
> Key: HDFS-9358
> URL: https://issues.apache.org/jira/browse/HDFS-9358
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: Wei-Chiu Chuang
> Assignee: Masatake Iwasaki
> Attachments: HDFS-9358.001.patch
>
>
> I have seen this test failure occurred a few times in trunk:
> Error Message
> Timeout: excess replica count not equal to 2 for block blk_1073741825_1001
> after 20000 msec. Last counts: live = 2, excess = 0, corrupt = 0
> Stacktrace
> java.util.concurrent.TimeoutException: Timeout: excess replica count not
> equal to 2 for block blk_1073741825_1001 after 20000 msec. Last counts: live
> = 2, excess = 0, corrupt = 0
> at
> org.apache.hadoop.hdfs.server.blockmanagement.TestNodeCount.checkTimeout(TestNodeCount.java:152)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.TestNodeCount.checkTimeout(TestNodeCount.java:146)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.TestNodeCount.__CLR4_0_39bdgm666uf(TestNodeCount.java:130)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.TestNodeCount.testNodeCount(TestNodeCount.java:54)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)