[
https://issues.apache.org/jira/browse/HBASE-25739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17359498#comment-17359498
]
Clara Xiong commented on HBASE-25739:
-------------------------------------
Hi [~Xiaolin Ha], bytable doesn't work in our case because it doesn't consider
other tables during the move and causes temporary imbalance on a cluster with
current load very sensitive to locality and balance.
My patch is to aggregate the deviation by the sum of absolute deviation per
node instead of max. It takes more calculation but bring the accuracy to the
same level as region count skew. As in my example, if half of the cluster is
taking all the regions for table, the skew is calculated at very low level. It
probably doesn't matter that much and takes a lot of time if you have many
tables. but this can be turned off or run by table in that case. balancer looks
at the max imbalance of all tables to try to balance them. so even if only 2
tables out of 1000 are unbalance, it will still run until all table are under
minCostNeedBalance.
> TableSkewCostFunction need to use aggregated deviation
> ------------------------------------------------------
>
> Key: HBASE-25739
> URL: https://issues.apache.org/jira/browse/HBASE-25739
> Project: HBase
> Issue Type: Sub-task
> Components: Balancer, master
> Reporter: Clara Xiong
> Assignee: Clara Xiong
> Priority: Major
> Attachments:
> TEST-org.apache.hadoop.hbase.master.balancer.TestStochasticLoadBalancerBalanceCluster.xml,
>
> org.apache.hadoop.hbase.master.balancer.TestStochasticLoadBalancerBalanceCluster.txt
>
>
> TableSkewCostFunction uses the sum of the max deviation region per server for
> all tables as the measure of unevenness. It doesn't work in a very common
> scenario in operations. Say we have 100 regions on 50 nodes, two on each. We
> add 50 new nodes and they have 0 each. The max deviation from the mean is 1,
> compared to 99 in the worst case scenario of 100 regions on a single server.
> The normalized cost is 1/99 = 0.011 < default threshold of 0.05. Balancer
> wouldn't move. The proposal is to use aggregated deviation of the count per
> region server to detect this scenario, generating a cost of 100/198 = 0.5 in
> this case.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)