[
https://issues.apache.org/jira/browse/HBASE-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Clara Xiong updated HBASE-26311:
--------------------------------
Description:
In production, we found a corner case where balancer cannot make progress when
there is cohosted replica. This is repro'ed on master branch using test added
in HBASE-26310. The two cost functions isn't provide proper evaluation so
balancer could make progress.
Another observation is the imbalance weight is not updated by the cost
functions properly during plan generation. The subsequent run reports much high
imbalance.
{quote}2021-09-24 22:26:56,039 INFO
org.apache.hadoop.hbase.master.balancer.S*tocha*sticLoadBalancer: Finished
computing new moving plan. Computation took 2400001 ms to try 1284702 different
iterations. Found a solution that moves 6941 regions; Going from a computed
imbalance of 6389.260497305375 to a new imbalance of 21.03904901349833.
2021-09-24 22:33:40,961 INFO
org.apache.hadoop.hbase.master.balancer.S*tocha*sticLoadBalancer: Running
balancer because at least one server hosts replicas of the same region.
2021-09-24 22:33:40,961 INFO
org.apache.hadoop.hbase.master.balancer.S*tocha*sticLoadBalancer: Start
S*tocha*sticLoadBalancer.balancer, initial weighted average
imbalance=6726.357026325619, functionCost=RegionCountSkewCostFunction :
(multiplier=500.0, imbalance=0.07721156356401288);
PrimaryRegionCountSkewCostFunction : (multiplier=500.0,
imbalance=0.06298215530179263); MoveCostFunction : (multiplier=7.0,
imbalance=0.0, balanced); ServerLocalityCostFunction : (multiplier=25.0,
imbalance=0.463289517245148); RackLocalityCostFunction : (multiplier=15.0,
imbalance=0.25670928199727017); TableSkewCostFunction : (multiplier=500.0,
imbalance=0.4378048676389543); RegionReplicaHostCostFunction :
(multiplier=100000.0, imbalance=0.05809798270893372);
RegionReplicaRackCostFunction : (multiplier=10000.0,
imbalance=0.061018251681075886); ReadRequestCostFunction : (multiplier=5.0,
imbalance=0.08235908576054465); WriteRequestCostFunction : (multiplier=5.0,
imbalance=0.09385090828285425); MemStoreSizeCostFunction : (multiplier=5.0,
imbalance=0.1327376982847744); StoreFileCostFunction : (multiplier=5.0,
imbalance=0.07986594927573858); computedMaxSteps=5579331200
{quote}
was:In production, we found a corner case where balancer cannot make progress
when there is cohosted replica. This is reproed on master branch using test
added in HBASE-26310. The cost function isn't provide proper evaluation so
balancer could make progress.
> Balancer gets stuck in cohosted replica distribution
> ----------------------------------------------------
>
> Key: HBASE-26311
> URL: https://issues.apache.org/jira/browse/HBASE-26311
> Project: HBase
> Issue Type: Bug
> Components: Balancer
> Reporter: Clara Xiong
> Assignee: Clara Xiong
> Priority: Major
>
> In production, we found a corner case where balancer cannot make progress
> when there is cohosted replica. This is repro'ed on master branch using test
> added in HBASE-26310. The two cost functions isn't provide proper evaluation
> so balancer could make progress.
>
> Another observation is the imbalance weight is not updated by the cost
> functions properly during plan generation. The subsequent run reports much
> high imbalance.
> {quote}2021-09-24 22:26:56,039 INFO
> org.apache.hadoop.hbase.master.balancer.S*tocha*sticLoadBalancer: Finished
> computing new moving plan. Computation took 2400001 ms to try 1284702
> different iterations. Found a solution that moves 6941 regions; Going from a
> computed imbalance of 6389.260497305375 to a new imbalance of
> 21.03904901349833.
> 2021-09-24 22:33:40,961 INFO
> org.apache.hadoop.hbase.master.balancer.S*tocha*sticLoadBalancer: Running
> balancer because at least one server hosts replicas of the same region.
> 2021-09-24 22:33:40,961 INFO
> org.apache.hadoop.hbase.master.balancer.S*tocha*sticLoadBalancer: Start
> S*tocha*sticLoadBalancer.balancer, initial weighted average
> imbalance=6726.357026325619, functionCost=RegionCountSkewCostFunction :
> (multiplier=500.0, imbalance=0.07721156356401288);
> PrimaryRegionCountSkewCostFunction : (multiplier=500.0,
> imbalance=0.06298215530179263); MoveCostFunction : (multiplier=7.0,
> imbalance=0.0, balanced); ServerLocalityCostFunction : (multiplier=25.0,
> imbalance=0.463289517245148); RackLocalityCostFunction : (multiplier=15.0,
> imbalance=0.25670928199727017); TableSkewCostFunction : (multiplier=500.0,
> imbalance=0.4378048676389543); RegionReplicaHostCostFunction :
> (multiplier=100000.0, imbalance=0.05809798270893372);
> RegionReplicaRackCostFunction : (multiplier=10000.0,
> imbalance=0.061018251681075886); ReadRequestCostFunction : (multiplier=5.0,
> imbalance=0.08235908576054465); WriteRequestCostFunction : (multiplier=5.0,
> imbalance=0.09385090828285425); MemStoreSizeCostFunction : (multiplier=5.0,
> imbalance=0.1327376982847744); StoreFileCostFunction : (multiplier=5.0,
> imbalance=0.07986594927573858); computedMaxSteps=5579331200
> {quote}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)