[ 
https://issues.apache.org/jira/browse/HDFS-6624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14051183#comment-14051183
 ] 

Liang Xie commented on HDFS-6624:
---------------------------------

It should be related with the chooseExcessReplicates behavior.
>From log:
{code}
2014-07-02 21:23:30,083 INFO  balancer.Balancer (Balancer.java:logNodes(969)) - 
0 over-utilized: []
2014-07-02 21:23:30,083 INFO  balancer.Balancer (Balancer.java:logNodes(969)) - 
1 above-average: [Source[127.0.0.1:55922, utilization=28.0]]
2014-07-02 21:23:30,083 INFO  balancer.Balancer (Balancer.java:logNodes(969)) - 
1 below-average: [BalancerDatanode[127.0.0.1:57889, utilization=16.0]]
2014-07-02 21:23:30,083 INFO  balancer.Balancer (Balancer.java:logNodes(969)) - 
0 underutilized: []
The cluster is balanced. Exiting...
{code}
and 
{code}
2014-07-02 21:23:35,413 INFO  hdfs.TestBalancer 
(TestBalancer.java:runBalancer(381)) - Rebalancing with default ctor.    <--- 
will call waitForBalancer immediately then.
{code}
we can know that before it should be balanced already. because 
avgUtilization=0.2, so 0.28 - 0.2 < BALANCE_ALLOWED_VARIANCE which is 0.11, | 
0.16 - 0.2 | < BALANCE_ALLOWED_VARIANCE.
but once gone to waitForBalancer, even retry getting a new DN report many 
times, the new added DN had a small nodeUtilization:0.08, since |0.08 - 02| > 
0.11, so after retry then timeout then failed...

>From those log we know that node removed a couple of blocks after balancing:
{code}
2014-07-02 21:23:30,136 INFO  BlockStateChange 
(BlockManager.java:addToInvalidates(1074)) - BLOCK* addToInvalidates: 
blk_1073741840_1016 127.0.0.1:55922 127.0.0.1:57889 
2014-07-02 21:23:30,136 INFO  BlockStateChange 
(BlockManager.java:addToInvalidates(1074)) - BLOCK* addToInvalidates: 
blk_1073741841_1017 127.0.0.1:57889 
2014-07-02 21:23:30,136 INFO  BlockStateChange 
(BlockManager.java:addToInvalidates(1074)) - BLOCK* addToInvalidates: 
blk_1073741842_1018 127.0.0.1:57889 
2014-07-02 21:23:34,305 INFO  BlockStateChange 
(BlockManager.java:invalidateWorkForOneNode(3262)) - BLOCK* BlockManager: ask 
127.0.0.1:57889 to delete [blk_1073741840_1016, blk_1073741841_1017, 
blk_1073741842_1018]
{code}

so the root cause is after balancing, the added block ops will trigger 
excessReplicates checking, then the removing will change the used space 
statistic, then failed the testing.

Is there any quick setting for testing could bypass that checking? :)

> TestBlockTokenWithDFS#testEnd2End fails sometimes
> -------------------------------------------------
>
>                 Key: HDFS-6624
>                 URL: https://issues.apache.org/jira/browse/HDFS-6624
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 3.0.0
>            Reporter: Andrew Wang
>         Attachments: PreCommit-HDFS-Build #7274 test - testEnd2End 
> [Jenkins].html
>
>
> On a recent test-patch.sh run, saw this error which did not repro locally:
> {noformat}
> Error Message
> Rebalancing expected avg utilization to become 0.2, but on datanode 
> 127.0.0.1:57889 it remains at 0.08 after more than 40000 msec.
> Stacktrace
> java.util.concurrent.TimeoutException: Rebalancing expected avg utilization 
> to become 0.2, but on datanode 127.0.0.1:57889 it remains at 0.08 after more 
> than 40000 msec.
>       at 
> org.apache.hadoop.hdfs.server.balancer.TestBalancer.waitForBalancer(TestBalancer.java:284)
>       at 
> org.apache.hadoop.hdfs.server.balancer.TestBalancer.runBalancer(TestBalancer.java:382)
>       at 
> org.apache.hadoop.hdfs.server.balancer.TestBalancer.doTest(TestBalancer.java:359)
>       at 
> org.apache.hadoop.hdfs.server.balancer.TestBalancer.oneNodeTest(TestBalancer.java:403)
>       at 
> org.apache.hadoop.hdfs.server.balancer.TestBalancer.integrationTest(TestBalancer.java:416)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.TestBlockTokenWithDFS.testEnd2End(TestBlockTokenWithDFS.java:588)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to