[
https://issues.apache.org/jira/browse/HBASE-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425839#comment-17425839
]
Clara Xiong commented on HBASE-26311:
-------------------------------------
Added a writeup for the bigger problem we ran into on a cluster on cloud.
> Balancer gets stuck in cohosted replica distribution
> ----------------------------------------------------
>
> Key: HBASE-26311
> URL: https://issues.apache.org/jira/browse/HBASE-26311
> Project: HBase
> Issue Type: Bug
> Components: Balancer
> Reporter: Clara Xiong
> Assignee: Clara Xiong
> Priority: Major
>
> In production, we found a corner case where balancer cannot make progress
> when there is cohosted replica. This is repro'ed on master branch using test
> added in HBASE-26310. The two cost functions isn't provide proper evaluation
> so balancer could make progress.
>
> Another observation is the imbalance weight is not updated by the cost
> functions properly during plan generation. The subsequent run reports much
> high imbalance.
> {quote}2021-09-24 22:26:56,039 INFO
> org.apache.hadoop.hbase.master.balancer.S*tocha*sticLoadBalancer: Finished
> computing new moving plan. Computation took 2400001 ms to try 1284702
> different iterations. Found a solution that moves 6941 regions; Going from a
> computed imbalance of 6389.260497305375 to a new imbalance of
> 21.03904901349833.
> 2021-09-24 22:33:40,961 INFO
> org.apache.hadoop.hbase.master.balancer.S*tocha*sticLoadBalancer: Running
> balancer because at least one server hosts replicas of the same region.
> 2021-09-24 22:33:40,961 INFO
> org.apache.hadoop.hbase.master.balancer.S*tocha*sticLoadBalancer: Start
> S*tocha*sticLoadBalancer.balancer, initial weighted average
> imbalance=6726.357026325619, functionCost=RegionCountSkewCostFunction :
> (multiplier=500.0, imbalance=0.07721156356401288);
> PrimaryRegionCountSkewCostFunction : (multiplier=500.0,
> imbalance=0.06298215530179263); MoveCostFunction : (multiplier=7.0,
> imbalance=0.0, balanced); ServerLocalityCostFunction : (multiplier=25.0,
> imbalance=0.463289517245148); RackLocalityCostFunction : (multiplier=15.0,
> imbalance=0.25670928199727017); TableSkewCostFunction : (multiplier=500.0,
> imbalance=0.4378048676389543); RegionReplicaHostCostFunction :
> (multiplier=100000.0, imbalance=0.05809798270893372);
> RegionReplicaRackCostFunction : (multiplier=10000.0,
> imbalance=0.061018251681075886); ReadRequestCostFunction : (multiplier=5.0,
> imbalance=0.08235908576054465); WriteRequestCostFunction : (multiplier=5.0,
> imbalance=0.09385090828285425); MemStoreSizeCostFunction : (multiplier=5.0,
> imbalance=0.1327376982847744); StoreFileCostFunction : (multiplier=5.0,
> imbalance=0.07986594927573858); computedMaxSteps=5579331200
> {quote}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)