[ 
https://issues.apache.org/jira/browse/HBASE-27054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17539754#comment-17539754
 ] 

David Manning commented on HBASE-27054:
---------------------------------------

With a lower timeout, like 60 seconds, or on slower hardware, we could get 
fewer iterations. I suppose in that sense we may just get unlucky in not being 
able to get to fully balanced state given current configuration.

50,000 regions have to move, and the {{RegionReplicaCandidateGenerator}} is 
doing most of that work, which is chosen roughly 25% of the time. There are 
likely some missteps. Conservatively, it seems like we may need 200,000 calls 
to guarantee the work gets done. That means 800,000 iterations. Running 
locally, if I had set a timeout of 60 seconds, I'd see 1.3 million iterations. 
It's close enough that we may see the occasional problem. The tests should 
ensure that even on slow hardware, with unlucky random choices, we are still 
virtually guaranteed success. We may not be doing that here. But a 3 minute 
timeout should make it much more likely. So I'm interested in the test message 
that says it ran 77 seconds, even though I'm sure the test could be improved to 
be more deterministic.

> TestStochasticLoadBalancerRegionReplicaLargeCluster.testRegionReplicasOnLargeCluster
>  is flaky  
> -----------------------------------------------------------------------------------------------
>
>                 Key: HBASE-27054
>                 URL: https://issues.apache.org/jira/browse/HBASE-27054
>             Project: HBase
>          Issue Type: Test
>          Components: test
>    Affects Versions: 2.5.0
>            Reporter: Andrew Kyle Purtell
>            Priority: Major
>             Fix For: 2.5.0, 3.0.0-alpha-3
>
>
> TestStochasticLoadBalancerRegionReplicaLargeCluster.testRegionReplicasOnLargeCluster
>   . Looks like we can be off by one on either side of an expected value.
> Any idea what is going on here [~dmanning]? 
> {noformat}
> org.apache.hadoop.hbase.master.balancer.TestStochasticLoadBalancerRegionReplicaLargeCluster.testRegionReplicasOnLargeCluster
>   Time elapsed: 77.779 s  <<< FAILURE!
> java.lang.AssertionError: All servers should have load no less than 60.
> server=srv1351292323,46522,-3543799643652531264 , load=59
>       at org.junit.Assert.fail(Assert.java:89)
>       at org.junit.Assert.assertTrue(Assert.java:42)
>       at 
> org.apache.hadoop.hbase.master.balancer.BalancerTestBase.assertClusterAsBalanced(BalancerTestBase.java:200)
>       at 
> org.apache.hadoop.hbase.master.balancer.BalancerTestBase.testWithCluster(BalancerTestBase.java:577)
>       at 
> org.apache.hadoop.hbase.master.balancer.BalancerTestBase.testWithCluster(BalancerTestBase.java:544)
>       at 
> org.apache.hadoop.hbase.master.balancer.TestStochasticLoadBalancerRegionReplicaLargeCluster.testRegionReplicasOnLargeCluster(TestStochasticLoadBalancerRegionReplicaLargeCluster.java:41)
> {noformat}
> {noformat}
> org.apache.hadoop.hbase.master.balancer.TestStochasticLoadBalancerRegionReplicaLargeCluster.testRegionReplicasOnLargeCluster
>   Time elapsed: 77.781 s  <<< FAILURE!
> java.lang.AssertionError: All servers should have load no more than 60. 
> server=srv1402325691,7995,26308078476749652 , load=61
>       at org.junit.Assert.fail(Assert.java:89)
>       at org.junit.Assert.assertTrue(Assert.java:42)
>       at 
> org.apache.hadoop.hbase.master.balancer.BalancerTestBase.assertClusterAsBalanced(BalancerTestBase.java:198)
>       at 
> org.apache.hadoop.hbase.master.balancer.BalancerTestBase.testWithCluster(BalancerTestBase.java:577)
>       at 
> org.apache.hadoop.hbase.master.balancer.BalancerTestBase.testWithCluster(BalancerTestBase.java:544)
>       at 
> org.apache.hadoop.hbase.master.balancer.TestStochasticLoadBalancerRegionReplicaLargeCluster.testRegionReplicasOnLargeCluster(TestStochasticLoadBalancerRegionReplicaLargeCluster.java:41)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to