[ 
https://issues.apache.org/jira/browse/HDFS-6250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13989495#comment-13989495
 ] 

Binglin Chang commented on HDFS-6250:
-------------------------------------

Thanks for the analysis and patch [~airbots]. The fix makes sense,  here are 
some additional concerns:

bq. HDFS creates a /system/balancer.id file (30B) to track the balancer
Looks like the file contains hostname, whose size is not fixed, I see you 
increased block size and capacity to minimize the impact of the file, but it 
seems the risk is still there.

testBalancerWithRackLocality tests balancer do not perform cross rack block 
movements in test scenario, here are the related balancer logs:

{code}
014-04-15 18:29:48,649 INFO  balancer.Balancer (Balancer.java:logNodes(960)) - 
0 over-utilized: []
2014-04-15 18:29:48,650 INFO  balancer.Balancer (Balancer.java:logNodes(960)) - 
2 above-average: [Source[127.0.0.1:54333, utilization=30.0], 
Source[127.0.0.1:46174, utilization=30.0]]
2014-04-15 18:29:48,650 INFO  balancer.Balancer (Balancer.java:logNodes(960)) - 
0 below-average: []
2014-04-15 18:29:48,650 INFO  balancer.Balancer (Balancer.java:logNodes(960)) - 
1 underutilized: [BalancerDatanode[127.0.0.1:48293, utilization=0.0]]

2014-04-15 18:29:51,722 INFO  balancer.Balancer (Balancer.java:logNodes(960)) - 
0 over-utilized: []
2014-04-15 18:29:51,722 INFO  balancer.Balancer (Balancer.java:logNodes(960)) - 
2 above-average: [Source[127.0.0.1:54333, utilization=30.166666666666668], 
Source[127.0.0.1:46174, utilization=30.333333333333332]]
2014-04-15 18:29:51,722 INFO  balancer.Balancer (Balancer.java:logNodes(960)) - 
0 below-average: []
2014-04-15 18:29:51,722 INFO  balancer.Balancer (Balancer.java:logNodes(960)) - 
1 underutilized: [BalancerDatanode[127.0.0.1:48293, 
utilization=1.8333333333333333]]

2014-04-15 18:29:54,820 INFO  balancer.Balancer (Balancer.java:logNodes(960)) - 
0 over-utilized: []
2014-04-15 18:29:54,820 INFO  balancer.Balancer (Balancer.java:logNodes(960)) - 
2 above-average: [Source[127.0.0.1:54333, utilization=28.5], 
Source[127.0.0.1:46174, utilization=30.333333333333332]]
2014-04-15 18:29:54,820 INFO  balancer.Balancer (Balancer.java:logNodes(960)) - 
0 below-average: []
2014-04-15 18:29:54,820 INFO  balancer.Balancer (Balancer.java:logNodes(960)) - 
1 underutilized: [BalancerDatanode[127.0.0.1:48293, utilization=5.0]]

2014-04-15 18:29:57,898 INFO  balancer.Balancer (Balancer.java:logNodes(960)) - 
0 over-utilized: []
2014-04-15 18:29:57,898 INFO  balancer.Balancer (Balancer.java:logNodes(960)) - 
2 above-average: [Source[127.0.0.1:46174, utilization=30.333333333333332], 
Source[127.0.0.1:54333, utilization=25.333333333333332]]
2014-04-15 18:29:57,899 INFO  balancer.Balancer (Balancer.java:logNodes(960)) - 
0 below-average: []
2014-04-15 18:29:57,899 INFO  balancer.Balancer (Balancer.java:logNodes(960)) - 
1 underutilized: [BalancerDatanode[127.0.0.1:48293, 
utilization=7.666666666666667]]

2014-04-15 18:30:00,933 INFO  balancer.Balancer (Balancer.java:logNodes(960)) - 
0 over-utilized: []
2014-04-15 18:30:00,933 INFO  balancer.Balancer (Balancer.java:logNodes(960)) - 
2 above-average: [Source[127.0.0.1:54333, utilization=22.666666666666668], 
Source[127.0.0.1:46174, utilization=30.333333333333332]]
2014-04-15 18:30:00,933 INFO  balancer.Balancer (Balancer.java:logNodes(960)) - 
0 below-average: []
2014-04-15 18:30:00,933 INFO  balancer.Balancer (Balancer.java:logNodes(960)) - 
1 underutilized: [BalancerDatanode[127.0.0.1:48293, utilization=10.5]]

2014-04-15 18:30:03,989 INFO  balancer.Balancer (Balancer.java:logNodes(960)) - 
0 over-utilized: []
2014-04-15 18:30:03,989 INFO  balancer.Balancer (Balancer.java:logNodes(960)) - 
1 above-average: [Source[127.0.0.1:46174, utilization=30.333333333333332]]
2014-04-15 18:30:03,989 INFO  balancer.Balancer (Balancer.java:logNodes(960)) - 
2 below-average: [BalancerDatanode[127.0.0.1:54333, 
utilization=19.833333333333332], BalancerDatanode[127.0.0.1:48293, 
utilization=12.0]]
2014-04-15 18:30:03,989 INFO  balancer.Balancer (Balancer.java:logNodes(960)) - 
0 underutilized: []
{code}

I guess the test intended to let /rack0/NODEGROUP0/dn above-average(<=30%) but 
not over-utilized(>30%, consider avg utilization=20%), so blocks on rack0 never 
move to rack1, but another balancer.id file may break the assumption. So there 
are some problem inherently in the test, not just race condition or timeout 
stuff. We may need to change the test(e.g. file size, utilize rate, validate 
method) to prevent those corner cases.


> TestBalancerWithNodeGroup.testBalancerWithRackLocality fails
> ------------------------------------------------------------
>
>                 Key: HDFS-6250
>                 URL: https://issues.apache.org/jira/browse/HDFS-6250
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Kihwal Lee
>            Assignee: Chen He
>         Attachments: HDFS-6250-v2.patch, HDFS-6250.patch, test_log.txt
>
>
> It was seen in https://builds.apache.org/job/PreCommit-HDFS-Build/6669/
> {panel}
> java.lang.AssertionError: expected:<1800> but was:<1810>
>       at org.junit.Assert.fail(Assert.java:93)
>       at org.junit.Assert.failNotEquals(Assert.java:647)
>       at org.junit.Assert.assertEquals(Assert.java:128)
>       at org.junit.Assert.assertEquals(Assert.java:147)
>       at org.apache.hadoop.hdfs.server.balancer.TestBalancerWithNodeGroup
>  .testBalancerWithRackLocality(TestBalancerWithNodeGroup.java:253)
> {panel}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to