[ https://issues.apache.org/jira/browse/HBASE-27054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17539749#comment-17539749 ]
David Manning commented on HBASE-27054: --------------------------------------- [~apurtell] Do we know if this is a recent regression, or has it always been flaky? My initial thought is that there may be some randomness (it is a stochastic balancer after all) which leads to this end result. I don't believe any recent changes would have caused this to become more flaky, but I suppose it's possible. HBASE-26311 is interesting, since it changes calculations to use standard deviation. [~claraxiong] Why does the error message say it failed after 77 seconds? The test takes 3 minutes to run for me locally, which is the configured timeout for the balancer in {{StochasticBalancerTestBase2}}. Is there a link to a test failure with full logs that I can inspect? (Note, 3 minute timeout was updated in HBASE-25873. Previous value was 90 seconds.) With region replicas involved, the {{RegionReplicaCandidateGenerator}} will just move a colocated replica to a random server, without consideration of how many regions that target server is hosting. The cost functions will allow it in basically every case, since it heavily prioritizes resolving colocated replicas. So maybe by the time all the region replicas have been resolved, the number of moves is already pushing limits of one balancer iteration, with having randomly overloaded one regionserver. A situation that the balancer will have a difficult time getting out of is if one regionserver is hosting 61 replicas of 61 regions, and another regionserver is hosting 59 regions, which are replicas of those 61 regions. The {{LoadCandidateGenerator}} will keep trying to take a region from the server with 61 and give it to the server with 59, but because there is already a replica that matches, it will be too expensive to move. But as long as we can process enough iterations, probabilistically speaking we should be able to get to one of the 2 safe regions to move... when I run this test locally I see nearly 4 million iterations, and with 1/4 of those using the {{LoadCandidateGenerator}} it seems like we should generally find a solution that moves them all. {code} Finished computing new moving plan. Computation took 180001 ms to try 3975554 different iterations. Found a solution that moves 50006 regions; Going from a computed imbalance of 0.9026309610781538 to a new imbalance of 5.252006025578701E-5. funtionCost=RegionCountSkewCostFunction : (multiplier=500.0, imbalance=0.0); PrimaryRegionCountSkewCostFunction : (multiplier=500.0, imbalance=0.0); MoveCostFunction : (multiplier=7.0, imbalance=0.8334333333333334, need balance); RackLocalityCostFunction : (multiplier=15.0, imbalance=0.0); TableSkewCostFunction : (multiplier=35.0, imbalance=0.0); RegionReplicaHostCostFunction : (multiplier=100000.0, imbalance=0.0); RegionReplicaRackCostFunction : (multiplier=10000.0, imbalance=0.0); ReadRequestCostFunction : (multiplier=5.0, imbalance=0.0); CPRequestCostFunction : (multiplier=5.0, imbalance=0.0); WriteRequestCostFunction : (multiplier=5.0, imbalance=0.0); MemStoreSizeCostFunction : (multiplier=5.0, imbalance=0.0); StoreFileCostFunction : (multiplier=5.0, imbalance=0.0); {code} Since the test case is also using 100 tables, and there is a {{TableSkewCostFunction}} involved, it's also possible that the balancer is happy with a slightly uneven region count balance, because balancing the last region would push towards an imbalance of tables if the target regionserver already has too many regions of that table for every region that is chosen. I don't know if the math would support this, though. If it does, it's possible that out of the last 61 regions, moving any region to the server with 59 would either cause table skew or colocated replicas, and so the balancer cannot fully balance based on the simple {{LoadCandidateGenerator}} alone. This is all hypothetical, without yet trying to debug. Given the large size of the test, the number of balancer iterations, and the flakiness, it may be difficult to debug. I ran it 10+ times locally so far, and it passes each time. So, some ideas to explore: # Don't assert that the cluster is fully balanced in this test case, just assert that there are no colocated replicas. Arguably this is the purpose of the test, and the test framework already appears to allow for this. # Change cost function weights for everything else, other than region counts and replica counts, to be 0. In this way, nothing prevents the balancer optimizing for these variables, which the test is expecting to validate. Specifically, set TableSkew and MoveCost functions to 0. # Use fewer than 100 tables, if table skew is a contributing factor. > TestStochasticLoadBalancerRegionReplicaLargeCluster.testRegionReplicasOnLargeCluster > is flaky > ----------------------------------------------------------------------------------------------- > > Key: HBASE-27054 > URL: https://issues.apache.org/jira/browse/HBASE-27054 > Project: HBase > Issue Type: Test > Components: test > Affects Versions: 2.5.0 > Reporter: Andrew Kyle Purtell > Priority: Major > Fix For: 2.5.0, 3.0.0-alpha-3 > > > TestStochasticLoadBalancerRegionReplicaLargeCluster.testRegionReplicasOnLargeCluster > . Looks like we can be off by one on either side of an expected value. > Any idea what is going on here [~dmanning]? > {noformat} > org.apache.hadoop.hbase.master.balancer.TestStochasticLoadBalancerRegionReplicaLargeCluster.testRegionReplicasOnLargeCluster > Time elapsed: 77.779 s <<< FAILURE! > java.lang.AssertionError: All servers should have load no less than 60. > server=srv1351292323,46522,-3543799643652531264 , load=59 > at org.junit.Assert.fail(Assert.java:89) > at org.junit.Assert.assertTrue(Assert.java:42) > at > org.apache.hadoop.hbase.master.balancer.BalancerTestBase.assertClusterAsBalanced(BalancerTestBase.java:200) > at > org.apache.hadoop.hbase.master.balancer.BalancerTestBase.testWithCluster(BalancerTestBase.java:577) > at > org.apache.hadoop.hbase.master.balancer.BalancerTestBase.testWithCluster(BalancerTestBase.java:544) > at > org.apache.hadoop.hbase.master.balancer.TestStochasticLoadBalancerRegionReplicaLargeCluster.testRegionReplicasOnLargeCluster(TestStochasticLoadBalancerRegionReplicaLargeCluster.java:41) > {noformat} > {noformat} > org.apache.hadoop.hbase.master.balancer.TestStochasticLoadBalancerRegionReplicaLargeCluster.testRegionReplicasOnLargeCluster > Time elapsed: 77.781 s <<< FAILURE! > java.lang.AssertionError: All servers should have load no more than 60. > server=srv1402325691,7995,26308078476749652 , load=61 > at org.junit.Assert.fail(Assert.java:89) > at org.junit.Assert.assertTrue(Assert.java:42) > at > org.apache.hadoop.hbase.master.balancer.BalancerTestBase.assertClusterAsBalanced(BalancerTestBase.java:198) > at > org.apache.hadoop.hbase.master.balancer.BalancerTestBase.testWithCluster(BalancerTestBase.java:577) > at > org.apache.hadoop.hbase.master.balancer.BalancerTestBase.testWithCluster(BalancerTestBase.java:544) > at > org.apache.hadoop.hbase.master.balancer.TestStochasticLoadBalancerRegionReplicaLargeCluster.testRegionReplicasOnLargeCluster(TestStochasticLoadBalancerRegionReplicaLargeCluster.java:41) > {noformat} -- This message was sent by Atlassian Jira (v8.20.7#820007)