[jira] [Comment Edited] (HBASE-26311) Balancer gets stuck in cohosted replica distribution

Clara Xiong (Jira) Sun, 10 Oct 2021 15:27:04 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424743#comment-17424743
 ]


Clara Xiong edited comment on HBASE-26311 at 10/10/21, 10:26 PM:
-----------------------------------------------------------------

When balancer has to satisfy other constraints, even region count distribution 
just cannot be guaranteed, as in existing test case 
TestStochasticLoadBalancerRegionReplicaWithRacks. Because replica distribution 
has much higher weight than region count skew, the rack with fewer servers tend 
to get more regions than those with more servers.

In this test case, server 0 and 1 are on the same rack while 2 and 3 are on 
each's rack because servers cannot be place completely evenly. The resulted 
region count distribution can be [2,2, 4, 4] or be[1, 3, 4, 4]so that we have 
no replicas of the same region on the first rack. So we have to have fewer 
regions per server on the first two servers. With the current algorithm, the 
costs of two plan are the same for region count skew because only linear 
deviation to ideal average is considered. It can get much more extreme when we 
have 5 servers for this test case: [1,3,3,3,5]or [2,2,3,3,5] depending on the 
random walk. But since the algorithm says they are the same expense for region 
count skew, balancer can be stuck at the former. The more servers we have, as 
long as the RS counts are not completely even, which happens all the time, the 
more variation of results we will see depending the random walk. But once we 
reach the extreme case, balancer is stuck because the cost function says moving 
doesn't gain.

I am proposing using the sum of square of deviation for load functions, inline 
with replica cost functions.  see 
https://issues.apache.org/jira/browse/HBASE-25625


was (Author: claraxiong):
When balancer has to satisfy other constraints, even region count distribution 
just cannot be guaranteed, as in existing test case 
TestStochasticLoadBalancerRegionReplicaWithRacks. Because replica distribution 
has much higher weight than region count skew, the rack with fewer servers tend 
to get more regions than those with more servers.

In this test case, server 0 and 1 are on the same rack while 2 and 3 are on 
each's rack because servers cannot be place completely evenly. The resulted 
region count distribution can be [2,2, 4, 4] or be[1, 3, 4, 4]so that we have 
no replicas of the same region on the first rack. So we have to have fewer 
regions per server on the first two servers. With the current algorithm, the 
costs of two plan are the same for region count skew because only linear 
deviation to ideal average is considered. It can get much more extreme when we 
have 5 servers for this test case: [1,3,3,3,5]or [2,2,3,3,5] depending on the 
random walk. But since the algorithm says they are the same expense for region 
count skew, balancer can be stuck at the former. The more servers we have, as 
long as the RS counts are not completely even, which happens all the time, the 
more variation of results we will see depending the random walk. But once we 
reach the extreme case, balancer is stuck because the cost function says moving 
doesn't gain.

I am proposing using the sum of square of deviation for load functions, inline 
with replica cost functions. we don't need standard deviation so we can keep it 
simple and fast. see https://issues.apache.org/jira/browse/HBASE-25625

> Balancer gets stuck in cohosted replica distribution
> ----------------------------------------------------
>
>                 Key: HBASE-26311
>                 URL: https://issues.apache.org/jira/browse/HBASE-26311
>             Project: HBase
>          Issue Type: Bug
>          Components: Balancer
>            Reporter: Clara Xiong
>            Assignee: Clara Xiong
>            Priority: Major
>
> In production, we found a corner case where balancer cannot make progress 
> when there is cohosted replica. This is repro'ed on master branch using test 
> added in HBASE-26310. The two cost functions isn't provide proper evaluation 
> so balancer could make progress. 
>  
> Another observation is the imbalance weight is not updated by the cost 
> functions properly during plan generation. The subsequent run reports much 
> high imbalance.
> {quote}2021-09-24 22:26:56,039 INFO 
> org.apache.hadoop.hbase.master.balancer.S*tocha*sticLoadBalancer: Finished 
> computing new moving plan. Computation took 2400001 ms to try 1284702 
> different iterations.  Found a solution that moves 6941 regions; Going from a 
> computed imbalance of 6389.260497305375 to a new imbalance of 
> 21.03904901349833. 
> 2021-09-24 22:33:40,961 INFO 
> org.apache.hadoop.hbase.master.balancer.S*tocha*sticLoadBalancer: Running 
> balancer because at least one server hosts replicas of the same region.
> 2021-09-24 22:33:40,961 INFO 
> org.apache.hadoop.hbase.master.balancer.S*tocha*sticLoadBalancer: Start 
> S*tocha*sticLoadBalancer.balancer, initial weighted average 
> imbalance=6726.357026325619, functionCost=RegionCountSkewCostFunction : 
> (multiplier=500.0, imbalance=0.07721156356401288); 
> PrimaryRegionCountSkewCostFunction : (multiplier=500.0, 
> imbalance=0.06298215530179263); MoveCostFunction : (multiplier=7.0, 
> imbalance=0.0, balanced); ServerLocalityCostFunction : (multiplier=25.0, 
> imbalance=0.463289517245148); RackLocalityCostFunction : (multiplier=15.0, 
> imbalance=0.25670928199727017); TableSkewCostFunction : (multiplier=500.0, 
> imbalance=0.4378048676389543); RegionReplicaHostCostFunction : 
> (multiplier=100000.0, imbalance=0.05809798270893372); 
> RegionReplicaRackCostFunction : (multiplier=10000.0, 
> imbalance=0.061018251681075886); ReadRequestCostFunction : (multiplier=5.0, 
> imbalance=0.08235908576054465); WriteRequestCostFunction : (multiplier=5.0, 
> imbalance=0.09385090828285425); MemStoreSizeCostFunction : (multiplier=5.0, 
> imbalance=0.1327376982847744); StoreFileCostFunction : (multiplier=5.0, 
> imbalance=0.07986594927573858);  computedMaxSteps=5579331200
> {quote}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (HBASE-26311) Balancer gets stuck in cohosted replica distribution

Reply via email to