[ 
https://issues.apache.org/jira/browse/HDFS-9358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15006392#comment-15006392
 ] 

Walter Su commented on HDFS-9358:
---------------------------------

1. We can set heartBeat interval to 1s to shorten running time.

2. I think the 001 can patch solve the posted issue. Firstly thanks for that. 
However I think the race condition still exists?
{code}
125       cluster.restartDataNode(dnprop);
126       cluster.waitActive();
127 
128       // check if excessive replica is detected (transient)
129       initializeTimeout(TIMEOUT);
130       while (countNodes(block.getLocalBlock(), namesystem).excessReplicas() 
!= 2) {
131         checkTimeout("excess replica count not equal to 2");
132       }
{code}

The old code expects 2 excessReplicas. The 001 patch expects 1 excessReplicas. 
No matter how many excessReplicas we want, as you can see from the comment, the 
state is "transient". What if the state vanished before line 130? It's unlikely 
I know but the jenkins machine is under heavy load, who knows?

So I think we can disable block invalidation by setting large delay to make it 
non-transient, then the test is more stable. Check 
{{InvalidateBlocks.getInvalidationDelay()}}. Then we solved the issue and the 
test logic changes in 001 patch is not required. How do you think?

> TestNodeCount#testNodeCount timed out
> -------------------------------------
>
>                 Key: HDFS-9358
>                 URL: https://issues.apache.org/jira/browse/HDFS-9358
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Wei-Chiu Chuang
>            Assignee: Masatake Iwasaki
>         Attachments: HDFS-9358.001.patch
>
>
> I have seen this test failure occurred a few times in trunk:
> Error Message
> Timeout: excess replica count not equal to 2 for block blk_1073741825_1001 
> after 20000 msec.  Last counts: live = 2, excess = 0, corrupt = 0
> Stacktrace
> java.util.concurrent.TimeoutException: Timeout: excess replica count not 
> equal to 2 for block blk_1073741825_1001 after 20000 msec.  Last counts: live 
> = 2, excess = 0, corrupt = 0
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.TestNodeCount.checkTimeout(TestNodeCount.java:152)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.TestNodeCount.checkTimeout(TestNodeCount.java:146)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.TestNodeCount.__CLR4_0_39bdgm666uf(TestNodeCount.java:130)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.TestNodeCount.testNodeCount(TestNodeCount.java:54)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to