[jira] [Commented] (HBASE-27054) TestStochasticLoadBalancerRegionReplicaLargeCluster.testRegionReplicasOnLargeCluster is flaky

David Manning (Jira) Thu, 19 May 2022 12:05:05 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-27054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17539749#comment-17539749
 ]


David Manning commented on HBASE-27054:
---------------------------------------

[~apurtell] Do we know if this is a recent regression, or has it always been 
flaky? My initial thought is that there may be some randomness (it is a 
stochastic balancer after all) which leads to this end result. I don't believe 
any recent changes would have caused this to become more flaky, but I suppose 
it's possible. HBASE-26311 is interesting, since it changes calculations to use 
standard deviation. [~claraxiong]

Why does the error message say it failed after 77 seconds? The test takes 3 
minutes to run for me locally, which is the configured timeout for the balancer 
in {{StochasticBalancerTestBase2}}. Is there a link to a test failure with full 
logs that I can inspect? (Note, 3 minute timeout was updated in HBASE-25873. 
Previous value was 90 seconds.)

With region replicas involved, the {{RegionReplicaCandidateGenerator}} will 
just move a colocated replica to a random server, without consideration of how 
many regions that target server is hosting. The cost functions will allow it in 
basically every case, since it heavily prioritizes resolving colocated 
replicas. So maybe by the time all the region replicas have been resolved, the 
number of moves is already pushing limits of one balancer iteration, with 
having randomly overloaded one regionserver.

A situation that the balancer will have a difficult time getting out of is if 
one regionserver is hosting 61 replicas of 61 regions, and another regionserver 
is hosting 59 regions, which are replicas of those 61 regions. The 
{{LoadCandidateGenerator}} will keep trying to take a region from the server 
with 61 and give it to the server with 59, but because there is already a 
replica that matches, it will be too expensive to move. But as long as we can 
process enough iterations, probabilistically speaking we should be able to get 
to one of the 2 safe regions to move... when I run this test locally I see 
nearly 4 million iterations, and with 1/4 of those using the 
{{LoadCandidateGenerator}} it seems like we should generally find a solution 
that moves them all.

{code}
Finished computing new moving plan. Computation took 180001 ms to try 3975554 
different iterations.  Found a solution that moves 50006 regions; Going from a 
computed imbalance of 0.9026309610781538 to a new imbalance of 
5.252006025578701E-5. funtionCost=RegionCountSkewCostFunction : 
(multiplier=500.0, imbalance=0.0); PrimaryRegionCountSkewCostFunction : 
(multiplier=500.0, imbalance=0.0); MoveCostFunction : (multiplier=7.0, 
imbalance=0.8334333333333334, need balance); RackLocalityCostFunction : 
(multiplier=15.0, imbalance=0.0); TableSkewCostFunction : (multiplier=35.0, 
imbalance=0.0); RegionReplicaHostCostFunction : (multiplier=100000.0, 
imbalance=0.0); RegionReplicaRackCostFunction : (multiplier=10000.0, 
imbalance=0.0); ReadRequestCostFunction : (multiplier=5.0, imbalance=0.0); 
CPRequestCostFunction : (multiplier=5.0, imbalance=0.0); 
WriteRequestCostFunction : (multiplier=5.0, imbalance=0.0); 
MemStoreSizeCostFunction : (multiplier=5.0, imbalance=0.0); 
StoreFileCostFunction : (multiplier=5.0, imbalance=0.0);
{code}

Since the test case is also using 100 tables, and there is a 
{{TableSkewCostFunction}} involved, it's also possible that the balancer is 
happy with a slightly uneven region count balance, because balancing the last 
region would push towards an imbalance of tables if the target regionserver 
already has too many regions of that table for every region that is chosen. I 
don't know if the math would support this, though. If it does, it's possible 
that out of the last 61 regions, moving any region to the server with 59 would 
either cause table skew or colocated replicas, and so the balancer cannot fully 
balance based on the simple {{LoadCandidateGenerator}} alone.

This is all hypothetical, without yet trying to debug. Given the large size of 
the test, the number of balancer iterations, and the flakiness, it may be 
difficult to debug. I ran it 10+ times locally so far, and it passes each time. 
So, some ideas to explore:
# Don't assert that the cluster is fully balanced in this test case, just 
assert that there are no colocated replicas. Arguably this is the purpose of 
the test, and the test framework already appears to allow for this.
# Change cost function weights for everything else, other than region counts 
and replica counts, to be 0. In this way, nothing prevents the balancer 
optimizing for these variables, which the test is expecting to validate. 
Specifically, set TableSkew and MoveCost functions to 0.
# Use fewer than 100 tables, if table skew is a contributing factor.

> TestStochasticLoadBalancerRegionReplicaLargeCluster.testRegionReplicasOnLargeCluster
>  is flaky  
> -----------------------------------------------------------------------------------------------
>
>                 Key: HBASE-27054
>                 URL: https://issues.apache.org/jira/browse/HBASE-27054
>             Project: HBase
>          Issue Type: Test
>          Components: test
>    Affects Versions: 2.5.0
>            Reporter: Andrew Kyle Purtell
>            Priority: Major
>             Fix For: 2.5.0, 3.0.0-alpha-3
>
>
> TestStochasticLoadBalancerRegionReplicaLargeCluster.testRegionReplicasOnLargeCluster
>   . Looks like we can be off by one on either side of an expected value.
> Any idea what is going on here [~dmanning]? 
> {noformat}
> org.apache.hadoop.hbase.master.balancer.TestStochasticLoadBalancerRegionReplicaLargeCluster.testRegionReplicasOnLargeCluster
>   Time elapsed: 77.779 s  <<< FAILURE!
> java.lang.AssertionError: All servers should have load no less than 60.
> server=srv1351292323,46522,-3543799643652531264 , load=59
>       at org.junit.Assert.fail(Assert.java:89)
>       at org.junit.Assert.assertTrue(Assert.java:42)
>       at 
> org.apache.hadoop.hbase.master.balancer.BalancerTestBase.assertClusterAsBalanced(BalancerTestBase.java:200)
>       at 
> org.apache.hadoop.hbase.master.balancer.BalancerTestBase.testWithCluster(BalancerTestBase.java:577)
>       at 
> org.apache.hadoop.hbase.master.balancer.BalancerTestBase.testWithCluster(BalancerTestBase.java:544)
>       at 
> org.apache.hadoop.hbase.master.balancer.TestStochasticLoadBalancerRegionReplicaLargeCluster.testRegionReplicasOnLargeCluster(TestStochasticLoadBalancerRegionReplicaLargeCluster.java:41)
> {noformat}
> {noformat}
> org.apache.hadoop.hbase.master.balancer.TestStochasticLoadBalancerRegionReplicaLargeCluster.testRegionReplicasOnLargeCluster
>   Time elapsed: 77.781 s  <<< FAILURE!
> java.lang.AssertionError: All servers should have load no more than 60. 
> server=srv1402325691,7995,26308078476749652 , load=61
>       at org.junit.Assert.fail(Assert.java:89)
>       at org.junit.Assert.assertTrue(Assert.java:42)
>       at 
> org.apache.hadoop.hbase.master.balancer.BalancerTestBase.assertClusterAsBalanced(BalancerTestBase.java:198)
>       at 
> org.apache.hadoop.hbase.master.balancer.BalancerTestBase.testWithCluster(BalancerTestBase.java:577)
>       at 
> org.apache.hadoop.hbase.master.balancer.BalancerTestBase.testWithCluster(BalancerTestBase.java:544)
>       at 
> org.apache.hadoop.hbase.master.balancer.TestStochasticLoadBalancerRegionReplicaLargeCluster.testRegionReplicasOnLargeCluster(TestStochasticLoadBalancerRegionReplicaLargeCluster.java:41)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (HBASE-27054) TestStochasticLoadBalancerRegionReplicaLargeCluster.testRegionReplicasOnLargeCluster is flaky

Reply via email to