[
https://issues.apache.org/jira/browse/HBASE-27054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17539749#comment-17539749
]
David Manning commented on HBASE-27054:
---------------------------------------
[~apurtell] Do we know if this is a recent regression, or has it always been
flaky? My initial thought is that there may be some randomness (it is a
stochastic balancer after all) which leads to this end result. I don't believe
any recent changes would have caused this to become more flaky, but I suppose
it's possible. HBASE-26311 is interesting, since it changes calculations to use
standard deviation. [~claraxiong]
Why does the error message say it failed after 77 seconds? The test takes 3
minutes to run for me locally, which is the configured timeout for the balancer
in {{StochasticBalancerTestBase2}}. Is there a link to a test failure with full
logs that I can inspect? (Note, 3 minute timeout was updated in HBASE-25873.
Previous value was 90 seconds.)
With region replicas involved, the {{RegionReplicaCandidateGenerator}} will
just move a colocated replica to a random server, without consideration of how
many regions that target server is hosting. The cost functions will allow it in
basically every case, since it heavily prioritizes resolving colocated
replicas. So maybe by the time all the region replicas have been resolved, the
number of moves is already pushing limits of one balancer iteration, with
having randomly overloaded one regionserver.
A situation that the balancer will have a difficult time getting out of is if
one regionserver is hosting 61 replicas of 61 regions, and another regionserver
is hosting 59 regions, which are replicas of those 61 regions. The
{{LoadCandidateGenerator}} will keep trying to take a region from the server
with 61 and give it to the server with 59, but because there is already a
replica that matches, it will be too expensive to move. But as long as we can
process enough iterations, probabilistically speaking we should be able to get
to one of the 2 safe regions to move... when I run this test locally I see
nearly 4 million iterations, and with 1/4 of those using the
{{LoadCandidateGenerator}} it seems like we should generally find a solution
that moves them all.
{code}
Finished computing new moving plan. Computation took 180001 ms to try 3975554
different iterations. Found a solution that moves 50006 regions; Going from a
computed imbalance of 0.9026309610781538 to a new imbalance of
5.252006025578701E-5. funtionCost=RegionCountSkewCostFunction :
(multiplier=500.0, imbalance=0.0); PrimaryRegionCountSkewCostFunction :
(multiplier=500.0, imbalance=0.0); MoveCostFunction : (multiplier=7.0,
imbalance=0.8334333333333334, need balance); RackLocalityCostFunction :
(multiplier=15.0, imbalance=0.0); TableSkewCostFunction : (multiplier=35.0,
imbalance=0.0); RegionReplicaHostCostFunction : (multiplier=100000.0,
imbalance=0.0); RegionReplicaRackCostFunction : (multiplier=10000.0,
imbalance=0.0); ReadRequestCostFunction : (multiplier=5.0, imbalance=0.0);
CPRequestCostFunction : (multiplier=5.0, imbalance=0.0);
WriteRequestCostFunction : (multiplier=5.0, imbalance=0.0);
MemStoreSizeCostFunction : (multiplier=5.0, imbalance=0.0);
StoreFileCostFunction : (multiplier=5.0, imbalance=0.0);
{code}
Since the test case is also using 100 tables, and there is a
{{TableSkewCostFunction}} involved, it's also possible that the balancer is
happy with a slightly uneven region count balance, because balancing the last
region would push towards an imbalance of tables if the target regionserver
already has too many regions of that table for every region that is chosen. I
don't know if the math would support this, though. If it does, it's possible
that out of the last 61 regions, moving any region to the server with 59 would
either cause table skew or colocated replicas, and so the balancer cannot fully
balance based on the simple {{LoadCandidateGenerator}} alone.
This is all hypothetical, without yet trying to debug. Given the large size of
the test, the number of balancer iterations, and the flakiness, it may be
difficult to debug. I ran it 10+ times locally so far, and it passes each time.
So, some ideas to explore:
# Don't assert that the cluster is fully balanced in this test case, just
assert that there are no colocated replicas. Arguably this is the purpose of
the test, and the test framework already appears to allow for this.
# Change cost function weights for everything else, other than region counts
and replica counts, to be 0. In this way, nothing prevents the balancer
optimizing for these variables, which the test is expecting to validate.
Specifically, set TableSkew and MoveCost functions to 0.
# Use fewer than 100 tables, if table skew is a contributing factor.
> TestStochasticLoadBalancerRegionReplicaLargeCluster.testRegionReplicasOnLargeCluster
> is flaky
> -----------------------------------------------------------------------------------------------
>
> Key: HBASE-27054
> URL: https://issues.apache.org/jira/browse/HBASE-27054
> Project: HBase
> Issue Type: Test
> Components: test
> Affects Versions: 2.5.0
> Reporter: Andrew Kyle Purtell
> Priority: Major
> Fix For: 2.5.0, 3.0.0-alpha-3
>
>
> TestStochasticLoadBalancerRegionReplicaLargeCluster.testRegionReplicasOnLargeCluster
> . Looks like we can be off by one on either side of an expected value.
> Any idea what is going on here [~dmanning]?
> {noformat}
> org.apache.hadoop.hbase.master.balancer.TestStochasticLoadBalancerRegionReplicaLargeCluster.testRegionReplicasOnLargeCluster
> Time elapsed: 77.779 s <<< FAILURE!
> java.lang.AssertionError: All servers should have load no less than 60.
> server=srv1351292323,46522,-3543799643652531264 , load=59
> at org.junit.Assert.fail(Assert.java:89)
> at org.junit.Assert.assertTrue(Assert.java:42)
> at
> org.apache.hadoop.hbase.master.balancer.BalancerTestBase.assertClusterAsBalanced(BalancerTestBase.java:200)
> at
> org.apache.hadoop.hbase.master.balancer.BalancerTestBase.testWithCluster(BalancerTestBase.java:577)
> at
> org.apache.hadoop.hbase.master.balancer.BalancerTestBase.testWithCluster(BalancerTestBase.java:544)
> at
> org.apache.hadoop.hbase.master.balancer.TestStochasticLoadBalancerRegionReplicaLargeCluster.testRegionReplicasOnLargeCluster(TestStochasticLoadBalancerRegionReplicaLargeCluster.java:41)
> {noformat}
> {noformat}
> org.apache.hadoop.hbase.master.balancer.TestStochasticLoadBalancerRegionReplicaLargeCluster.testRegionReplicasOnLargeCluster
> Time elapsed: 77.781 s <<< FAILURE!
> java.lang.AssertionError: All servers should have load no more than 60.
> server=srv1402325691,7995,26308078476749652 , load=61
> at org.junit.Assert.fail(Assert.java:89)
> at org.junit.Assert.assertTrue(Assert.java:42)
> at
> org.apache.hadoop.hbase.master.balancer.BalancerTestBase.assertClusterAsBalanced(BalancerTestBase.java:198)
> at
> org.apache.hadoop.hbase.master.balancer.BalancerTestBase.testWithCluster(BalancerTestBase.java:577)
> at
> org.apache.hadoop.hbase.master.balancer.BalancerTestBase.testWithCluster(BalancerTestBase.java:544)
> at
> org.apache.hadoop.hbase.master.balancer.TestStochasticLoadBalancerRegionReplicaLargeCluster.testRegionReplicasOnLargeCluster(TestStochasticLoadBalancerRegionReplicaLargeCluster.java:41)
> {noformat}
--
This message was sent by Atlassian Jira
(v8.20.7#820007)