[ https://issues.apache.org/jira/browse/HBASE-27054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17539754#comment-17539754 ]
David Manning commented on HBASE-27054: --------------------------------------- With a lower timeout, like 60 seconds, or on slower hardware, we could get fewer iterations. I suppose in that sense we may just get unlucky in not being able to get to fully balanced state given current configuration. 50,000 regions have to move, and the {{RegionReplicaCandidateGenerator}} is doing most of that work, which is chosen roughly 25% of the time. There are likely some missteps. Conservatively, it seems like we may need 200,000 calls to guarantee the work gets done. That means 800,000 iterations. Running locally, if I had set a timeout of 60 seconds, I'd see 1.3 million iterations. It's close enough that we may see the occasional problem. The tests should ensure that even on slow hardware, with unlucky random choices, we are still virtually guaranteed success. We may not be doing that here. But a 3 minute timeout should make it much more likely. So I'm interested in the test message that says it ran 77 seconds, even though I'm sure the test could be improved to be more deterministic. > TestStochasticLoadBalancerRegionReplicaLargeCluster.testRegionReplicasOnLargeCluster > is flaky > ----------------------------------------------------------------------------------------------- > > Key: HBASE-27054 > URL: https://issues.apache.org/jira/browse/HBASE-27054 > Project: HBase > Issue Type: Test > Components: test > Affects Versions: 2.5.0 > Reporter: Andrew Kyle Purtell > Priority: Major > Fix For: 2.5.0, 3.0.0-alpha-3 > > > TestStochasticLoadBalancerRegionReplicaLargeCluster.testRegionReplicasOnLargeCluster > . Looks like we can be off by one on either side of an expected value. > Any idea what is going on here [~dmanning]? > {noformat} > org.apache.hadoop.hbase.master.balancer.TestStochasticLoadBalancerRegionReplicaLargeCluster.testRegionReplicasOnLargeCluster > Time elapsed: 77.779 s <<< FAILURE! > java.lang.AssertionError: All servers should have load no less than 60. > server=srv1351292323,46522,-3543799643652531264 , load=59 > at org.junit.Assert.fail(Assert.java:89) > at org.junit.Assert.assertTrue(Assert.java:42) > at > org.apache.hadoop.hbase.master.balancer.BalancerTestBase.assertClusterAsBalanced(BalancerTestBase.java:200) > at > org.apache.hadoop.hbase.master.balancer.BalancerTestBase.testWithCluster(BalancerTestBase.java:577) > at > org.apache.hadoop.hbase.master.balancer.BalancerTestBase.testWithCluster(BalancerTestBase.java:544) > at > org.apache.hadoop.hbase.master.balancer.TestStochasticLoadBalancerRegionReplicaLargeCluster.testRegionReplicasOnLargeCluster(TestStochasticLoadBalancerRegionReplicaLargeCluster.java:41) > {noformat} > {noformat} > org.apache.hadoop.hbase.master.balancer.TestStochasticLoadBalancerRegionReplicaLargeCluster.testRegionReplicasOnLargeCluster > Time elapsed: 77.781 s <<< FAILURE! > java.lang.AssertionError: All servers should have load no more than 60. > server=srv1402325691,7995,26308078476749652 , load=61 > at org.junit.Assert.fail(Assert.java:89) > at org.junit.Assert.assertTrue(Assert.java:42) > at > org.apache.hadoop.hbase.master.balancer.BalancerTestBase.assertClusterAsBalanced(BalancerTestBase.java:198) > at > org.apache.hadoop.hbase.master.balancer.BalancerTestBase.testWithCluster(BalancerTestBase.java:577) > at > org.apache.hadoop.hbase.master.balancer.BalancerTestBase.testWithCluster(BalancerTestBase.java:544) > at > org.apache.hadoop.hbase.master.balancer.TestStochasticLoadBalancerRegionReplicaLargeCluster.testRegionReplicasOnLargeCluster(TestStochasticLoadBalancerRegionReplicaLargeCluster.java:41) > {noformat} -- This message was sent by Atlassian Jira (v8.20.7#820007)