[
https://issues.apache.org/jira/browse/HDFS-6624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14051183#comment-14051183
]
Liang Xie commented on HDFS-6624:
---------------------------------
It should be related with the chooseExcessReplicates behavior.
>From log:
{code}
2014-07-02 21:23:30,083 INFO balancer.Balancer (Balancer.java:logNodes(969)) -
0 over-utilized: []
2014-07-02 21:23:30,083 INFO balancer.Balancer (Balancer.java:logNodes(969)) -
1 above-average: [Source[127.0.0.1:55922, utilization=28.0]]
2014-07-02 21:23:30,083 INFO balancer.Balancer (Balancer.java:logNodes(969)) -
1 below-average: [BalancerDatanode[127.0.0.1:57889, utilization=16.0]]
2014-07-02 21:23:30,083 INFO balancer.Balancer (Balancer.java:logNodes(969)) -
0 underutilized: []
The cluster is balanced. Exiting...
{code}
and
{code}
2014-07-02 21:23:35,413 INFO hdfs.TestBalancer
(TestBalancer.java:runBalancer(381)) - Rebalancing with default ctor. <---
will call waitForBalancer immediately then.
{code}
we can know that before it should be balanced already. because
avgUtilization=0.2, so 0.28 - 0.2 < BALANCE_ALLOWED_VARIANCE which is 0.11, |
0.16 - 0.2 | < BALANCE_ALLOWED_VARIANCE.
but once gone to waitForBalancer, even retry getting a new DN report many
times, the new added DN had a small nodeUtilization:0.08, since |0.08 - 02| >
0.11, so after retry then timeout then failed...
>From those log we know that node removed a couple of blocks after balancing:
{code}
2014-07-02 21:23:30,136 INFO BlockStateChange
(BlockManager.java:addToInvalidates(1074)) - BLOCK* addToInvalidates:
blk_1073741840_1016 127.0.0.1:55922 127.0.0.1:57889
2014-07-02 21:23:30,136 INFO BlockStateChange
(BlockManager.java:addToInvalidates(1074)) - BLOCK* addToInvalidates:
blk_1073741841_1017 127.0.0.1:57889
2014-07-02 21:23:30,136 INFO BlockStateChange
(BlockManager.java:addToInvalidates(1074)) - BLOCK* addToInvalidates:
blk_1073741842_1018 127.0.0.1:57889
2014-07-02 21:23:34,305 INFO BlockStateChange
(BlockManager.java:invalidateWorkForOneNode(3262)) - BLOCK* BlockManager: ask
127.0.0.1:57889 to delete [blk_1073741840_1016, blk_1073741841_1017,
blk_1073741842_1018]
{code}
so the root cause is after balancing, the added block ops will trigger
excessReplicates checking, then the removing will change the used space
statistic, then failed the testing.
Is there any quick setting for testing could bypass that checking? :)
> TestBlockTokenWithDFS#testEnd2End fails sometimes
> -------------------------------------------------
>
> Key: HDFS-6624
> URL: https://issues.apache.org/jira/browse/HDFS-6624
> Project: Hadoop HDFS
> Issue Type: Bug
> Affects Versions: 3.0.0
> Reporter: Andrew Wang
> Attachments: PreCommit-HDFS-Build #7274 test - testEnd2End
> [Jenkins].html
>
>
> On a recent test-patch.sh run, saw this error which did not repro locally:
> {noformat}
> Error Message
> Rebalancing expected avg utilization to become 0.2, but on datanode
> 127.0.0.1:57889 it remains at 0.08 after more than 40000 msec.
> Stacktrace
> java.util.concurrent.TimeoutException: Rebalancing expected avg utilization
> to become 0.2, but on datanode 127.0.0.1:57889 it remains at 0.08 after more
> than 40000 msec.
> at
> org.apache.hadoop.hdfs.server.balancer.TestBalancer.waitForBalancer(TestBalancer.java:284)
> at
> org.apache.hadoop.hdfs.server.balancer.TestBalancer.runBalancer(TestBalancer.java:382)
> at
> org.apache.hadoop.hdfs.server.balancer.TestBalancer.doTest(TestBalancer.java:359)
> at
> org.apache.hadoop.hdfs.server.balancer.TestBalancer.oneNodeTest(TestBalancer.java:403)
> at
> org.apache.hadoop.hdfs.server.balancer.TestBalancer.integrationTest(TestBalancer.java:416)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.TestBlockTokenWithDFS.testEnd2End(TestBlockTokenWithDFS.java:588)
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.2#6252)