[ 
https://issues.apache.org/jira/browse/HBASE-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Clara Xiong updated HBASE-26311:
--------------------------------
    Description: 
In production, we found a corner case where balancer cannot make progress when 
there is cohosted replica. This is repro'ed on master branch using test added 
in HBASE-26310. The two cost functions isn't provide proper evaluation so 
balancer could make progress. 

 

Another observation is the imbalance weight is not updated by the cost 
functions properly during plan generation. The subsequent run reports much high 
imbalance.
{quote}2021-09-24 22:26:56,039 INFO 
org.apache.hadoop.hbase.master.balancer.S*tocha*sticLoadBalancer: Finished 
computing new moving plan. Computation took 2400001 ms to try 1284702 different 
iterations.  Found a solution that moves 6941 regions; Going from a computed 
imbalance of 6389.260497305375 to a new imbalance of 21.03904901349833. 

2021-09-24 22:33:40,961 INFO 
org.apache.hadoop.hbase.master.balancer.S*tocha*sticLoadBalancer: Running 
balancer because at least one server hosts replicas of the same region.

2021-09-24 22:33:40,961 INFO 
org.apache.hadoop.hbase.master.balancer.S*tocha*sticLoadBalancer: Start 
S*tocha*sticLoadBalancer.balancer, initial weighted average 
imbalance=6726.357026325619, functionCost=RegionCountSkewCostFunction : 
(multiplier=500.0, imbalance=0.07721156356401288); 
PrimaryRegionCountSkewCostFunction : (multiplier=500.0, 
imbalance=0.06298215530179263); MoveCostFunction : (multiplier=7.0, 
imbalance=0.0, balanced); ServerLocalityCostFunction : (multiplier=25.0, 
imbalance=0.463289517245148); RackLocalityCostFunction : (multiplier=15.0, 
imbalance=0.25670928199727017); TableSkewCostFunction : (multiplier=500.0, 
imbalance=0.4378048676389543); RegionReplicaHostCostFunction : 
(multiplier=100000.0, imbalance=0.05809798270893372); 
RegionReplicaRackCostFunction : (multiplier=10000.0, 
imbalance=0.061018251681075886); ReadRequestCostFunction : (multiplier=5.0, 
imbalance=0.08235908576054465); WriteRequestCostFunction : (multiplier=5.0, 
imbalance=0.09385090828285425); MemStoreSizeCostFunction : (multiplier=5.0, 
imbalance=0.1327376982847744); StoreFileCostFunction : (multiplier=5.0, 
imbalance=0.07986594927573858);  computedMaxSteps=5579331200
{quote}
 

  was:In production, we found a corner case where balancer cannot make progress 
when there is cohosted replica. This is reproed on master branch using test 
added in HBASE-26310. The cost function isn't provide proper evaluation so 
balancer could make progress.


> Balancer gets stuck in cohosted replica distribution
> ----------------------------------------------------
>
>                 Key: HBASE-26311
>                 URL: https://issues.apache.org/jira/browse/HBASE-26311
>             Project: HBase
>          Issue Type: Bug
>          Components: Balancer
>            Reporter: Clara Xiong
>            Assignee: Clara Xiong
>            Priority: Major
>
> In production, we found a corner case where balancer cannot make progress 
> when there is cohosted replica. This is repro'ed on master branch using test 
> added in HBASE-26310. The two cost functions isn't provide proper evaluation 
> so balancer could make progress. 
>  
> Another observation is the imbalance weight is not updated by the cost 
> functions properly during plan generation. The subsequent run reports much 
> high imbalance.
> {quote}2021-09-24 22:26:56,039 INFO 
> org.apache.hadoop.hbase.master.balancer.S*tocha*sticLoadBalancer: Finished 
> computing new moving plan. Computation took 2400001 ms to try 1284702 
> different iterations.  Found a solution that moves 6941 regions; Going from a 
> computed imbalance of 6389.260497305375 to a new imbalance of 
> 21.03904901349833. 
> 2021-09-24 22:33:40,961 INFO 
> org.apache.hadoop.hbase.master.balancer.S*tocha*sticLoadBalancer: Running 
> balancer because at least one server hosts replicas of the same region.
> 2021-09-24 22:33:40,961 INFO 
> org.apache.hadoop.hbase.master.balancer.S*tocha*sticLoadBalancer: Start 
> S*tocha*sticLoadBalancer.balancer, initial weighted average 
> imbalance=6726.357026325619, functionCost=RegionCountSkewCostFunction : 
> (multiplier=500.0, imbalance=0.07721156356401288); 
> PrimaryRegionCountSkewCostFunction : (multiplier=500.0, 
> imbalance=0.06298215530179263); MoveCostFunction : (multiplier=7.0, 
> imbalance=0.0, balanced); ServerLocalityCostFunction : (multiplier=25.0, 
> imbalance=0.463289517245148); RackLocalityCostFunction : (multiplier=15.0, 
> imbalance=0.25670928199727017); TableSkewCostFunction : (multiplier=500.0, 
> imbalance=0.4378048676389543); RegionReplicaHostCostFunction : 
> (multiplier=100000.0, imbalance=0.05809798270893372); 
> RegionReplicaRackCostFunction : (multiplier=10000.0, 
> imbalance=0.061018251681075886); ReadRequestCostFunction : (multiplier=5.0, 
> imbalance=0.08235908576054465); WriteRequestCostFunction : (multiplier=5.0, 
> imbalance=0.09385090828285425); MemStoreSizeCostFunction : (multiplier=5.0, 
> imbalance=0.1327376982847744); StoreFileCostFunction : (multiplier=5.0, 
> imbalance=0.07986594927573858);  computedMaxSteps=5579331200
> {quote}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to