[
https://issues.apache.org/jira/browse/HDFS-6250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13989495#comment-13989495
]
Binglin Chang commented on HDFS-6250:
-------------------------------------
Thanks for the analysis and patch [~airbots]. The fix makes sense, here are
some additional concerns:
bq. HDFS creates a /system/balancer.id file (30B) to track the balancer
Looks like the file contains hostname, whose size is not fixed, I see you
increased block size and capacity to minimize the impact of the file, but it
seems the risk is still there.
testBalancerWithRackLocality tests balancer do not perform cross rack block
movements in test scenario, here are the related balancer logs:
{code}
014-04-15 18:29:48,649 INFO balancer.Balancer (Balancer.java:logNodes(960)) -
0 over-utilized: []
2014-04-15 18:29:48,650 INFO balancer.Balancer (Balancer.java:logNodes(960)) -
2 above-average: [Source[127.0.0.1:54333, utilization=30.0],
Source[127.0.0.1:46174, utilization=30.0]]
2014-04-15 18:29:48,650 INFO balancer.Balancer (Balancer.java:logNodes(960)) -
0 below-average: []
2014-04-15 18:29:48,650 INFO balancer.Balancer (Balancer.java:logNodes(960)) -
1 underutilized: [BalancerDatanode[127.0.0.1:48293, utilization=0.0]]
2014-04-15 18:29:51,722 INFO balancer.Balancer (Balancer.java:logNodes(960)) -
0 over-utilized: []
2014-04-15 18:29:51,722 INFO balancer.Balancer (Balancer.java:logNodes(960)) -
2 above-average: [Source[127.0.0.1:54333, utilization=30.166666666666668],
Source[127.0.0.1:46174, utilization=30.333333333333332]]
2014-04-15 18:29:51,722 INFO balancer.Balancer (Balancer.java:logNodes(960)) -
0 below-average: []
2014-04-15 18:29:51,722 INFO balancer.Balancer (Balancer.java:logNodes(960)) -
1 underutilized: [BalancerDatanode[127.0.0.1:48293,
utilization=1.8333333333333333]]
2014-04-15 18:29:54,820 INFO balancer.Balancer (Balancer.java:logNodes(960)) -
0 over-utilized: []
2014-04-15 18:29:54,820 INFO balancer.Balancer (Balancer.java:logNodes(960)) -
2 above-average: [Source[127.0.0.1:54333, utilization=28.5],
Source[127.0.0.1:46174, utilization=30.333333333333332]]
2014-04-15 18:29:54,820 INFO balancer.Balancer (Balancer.java:logNodes(960)) -
0 below-average: []
2014-04-15 18:29:54,820 INFO balancer.Balancer (Balancer.java:logNodes(960)) -
1 underutilized: [BalancerDatanode[127.0.0.1:48293, utilization=5.0]]
2014-04-15 18:29:57,898 INFO balancer.Balancer (Balancer.java:logNodes(960)) -
0 over-utilized: []
2014-04-15 18:29:57,898 INFO balancer.Balancer (Balancer.java:logNodes(960)) -
2 above-average: [Source[127.0.0.1:46174, utilization=30.333333333333332],
Source[127.0.0.1:54333, utilization=25.333333333333332]]
2014-04-15 18:29:57,899 INFO balancer.Balancer (Balancer.java:logNodes(960)) -
0 below-average: []
2014-04-15 18:29:57,899 INFO balancer.Balancer (Balancer.java:logNodes(960)) -
1 underutilized: [BalancerDatanode[127.0.0.1:48293,
utilization=7.666666666666667]]
2014-04-15 18:30:00,933 INFO balancer.Balancer (Balancer.java:logNodes(960)) -
0 over-utilized: []
2014-04-15 18:30:00,933 INFO balancer.Balancer (Balancer.java:logNodes(960)) -
2 above-average: [Source[127.0.0.1:54333, utilization=22.666666666666668],
Source[127.0.0.1:46174, utilization=30.333333333333332]]
2014-04-15 18:30:00,933 INFO balancer.Balancer (Balancer.java:logNodes(960)) -
0 below-average: []
2014-04-15 18:30:00,933 INFO balancer.Balancer (Balancer.java:logNodes(960)) -
1 underutilized: [BalancerDatanode[127.0.0.1:48293, utilization=10.5]]
2014-04-15 18:30:03,989 INFO balancer.Balancer (Balancer.java:logNodes(960)) -
0 over-utilized: []
2014-04-15 18:30:03,989 INFO balancer.Balancer (Balancer.java:logNodes(960)) -
1 above-average: [Source[127.0.0.1:46174, utilization=30.333333333333332]]
2014-04-15 18:30:03,989 INFO balancer.Balancer (Balancer.java:logNodes(960)) -
2 below-average: [BalancerDatanode[127.0.0.1:54333,
utilization=19.833333333333332], BalancerDatanode[127.0.0.1:48293,
utilization=12.0]]
2014-04-15 18:30:03,989 INFO balancer.Balancer (Balancer.java:logNodes(960)) -
0 underutilized: []
{code}
I guess the test intended to let /rack0/NODEGROUP0/dn above-average(<=30%) but
not over-utilized(>30%, consider avg utilization=20%), so blocks on rack0 never
move to rack1, but another balancer.id file may break the assumption. So there
are some problem inherently in the test, not just race condition or timeout
stuff. We may need to change the test(e.g. file size, utilize rate, validate
method) to prevent those corner cases.
> TestBalancerWithNodeGroup.testBalancerWithRackLocality fails
> ------------------------------------------------------------
>
> Key: HDFS-6250
> URL: https://issues.apache.org/jira/browse/HDFS-6250
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: Kihwal Lee
> Assignee: Chen He
> Attachments: HDFS-6250-v2.patch, HDFS-6250.patch, test_log.txt
>
>
> It was seen in https://builds.apache.org/job/PreCommit-HDFS-Build/6669/
> {panel}
> java.lang.AssertionError: expected:<1800> but was:<1810>
> at org.junit.Assert.fail(Assert.java:93)
> at org.junit.Assert.failNotEquals(Assert.java:647)
> at org.junit.Assert.assertEquals(Assert.java:128)
> at org.junit.Assert.assertEquals(Assert.java:147)
> at org.apache.hadoop.hdfs.server.balancer.TestBalancerWithNodeGroup
> .testBalancerWithRackLocality(TestBalancerWithNodeGroup.java:253)
> {panel}
--
This message was sent by Atlassian JIRA
(v6.2#6252)