[ 
https://issues.apache.org/jira/browse/HBASE-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425839#comment-17425839
 ] 

Clara Xiong commented on HBASE-26311:
-------------------------------------

Added a writeup for the bigger problem we ran into on a cluster on cloud.

> Balancer gets stuck in cohosted replica distribution
> ----------------------------------------------------
>
>                 Key: HBASE-26311
>                 URL: https://issues.apache.org/jira/browse/HBASE-26311
>             Project: HBase
>          Issue Type: Bug
>          Components: Balancer
>            Reporter: Clara Xiong
>            Assignee: Clara Xiong
>            Priority: Major
>
> In production, we found a corner case where balancer cannot make progress 
> when there is cohosted replica. This is repro'ed on master branch using test 
> added in HBASE-26310. The two cost functions isn't provide proper evaluation 
> so balancer could make progress. 
>  
> Another observation is the imbalance weight is not updated by the cost 
> functions properly during plan generation. The subsequent run reports much 
> high imbalance.
> {quote}2021-09-24 22:26:56,039 INFO 
> org.apache.hadoop.hbase.master.balancer.S*tocha*sticLoadBalancer: Finished 
> computing new moving plan. Computation took 2400001 ms to try 1284702 
> different iterations.  Found a solution that moves 6941 regions; Going from a 
> computed imbalance of 6389.260497305375 to a new imbalance of 
> 21.03904901349833. 
> 2021-09-24 22:33:40,961 INFO 
> org.apache.hadoop.hbase.master.balancer.S*tocha*sticLoadBalancer: Running 
> balancer because at least one server hosts replicas of the same region.
> 2021-09-24 22:33:40,961 INFO 
> org.apache.hadoop.hbase.master.balancer.S*tocha*sticLoadBalancer: Start 
> S*tocha*sticLoadBalancer.balancer, initial weighted average 
> imbalance=6726.357026325619, functionCost=RegionCountSkewCostFunction : 
> (multiplier=500.0, imbalance=0.07721156356401288); 
> PrimaryRegionCountSkewCostFunction : (multiplier=500.0, 
> imbalance=0.06298215530179263); MoveCostFunction : (multiplier=7.0, 
> imbalance=0.0, balanced); ServerLocalityCostFunction : (multiplier=25.0, 
> imbalance=0.463289517245148); RackLocalityCostFunction : (multiplier=15.0, 
> imbalance=0.25670928199727017); TableSkewCostFunction : (multiplier=500.0, 
> imbalance=0.4378048676389543); RegionReplicaHostCostFunction : 
> (multiplier=100000.0, imbalance=0.05809798270893372); 
> RegionReplicaRackCostFunction : (multiplier=10000.0, 
> imbalance=0.061018251681075886); ReadRequestCostFunction : (multiplier=5.0, 
> imbalance=0.08235908576054465); WriteRequestCostFunction : (multiplier=5.0, 
> imbalance=0.09385090828285425); MemStoreSizeCostFunction : (multiplier=5.0, 
> imbalance=0.1327376982847744); StoreFileCostFunction : (multiplier=5.0, 
> imbalance=0.07986594927573858);  computedMaxSteps=5579331200
> {quote}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to