[
https://issues.apache.org/jira/browse/HBASE-25739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17367577#comment-17367577
]
Clara Xiong commented on HBASE-25739:
-------------------------------------
There is another bug in the original tableSkew cost function for aggregation of
the cost per table:
If we have 10 regions, one per table, evenly distributed on 10 nodes, the cost
is scale to 1.0.
The more tables we have, the closer the value will be to 1.0. The cost function
becomes useless.
All the balancer tests were set up with large numbers of tables with minimal
regions per table. This artificially inflates the total cost and trigger
balancer runs. With this fix on TableSkewFunction, we need to overhaul the
tests too.
{code:java}
protected double cost() {
double max = cluster.numRegions;
double min = ((double) cluster.numRegions) / cluster.numServers;
double value = 0;
for (int i = 0; i < cluster.numMaxRegionsPerTable.length; i++) {
value += cluster.numMaxRegionsPerTable[i];
}
LOG.info("min = {}, max = {}, cost= {}", min, max, value);
return scale(min, max, value);
}
}{code}
> TableSkewCostFunction need to use aggregated deviation
> ------------------------------------------------------
>
> Key: HBASE-25739
> URL: https://issues.apache.org/jira/browse/HBASE-25739
> Project: HBase
> Issue Type: Sub-task
> Components: Balancer, master
> Reporter: Clara Xiong
> Assignee: Clara Xiong
> Priority: Major
> Attachments:
> TEST-org.apache.hadoop.hbase.master.balancer.TestStochasticLoadBalancerBalanceCluster.xml,
>
> org.apache.hadoop.hbase.master.balancer.TestStochasticLoadBalancerBalanceCluster.txt
>
>
> TableSkewCostFunction uses the sum of the max deviation region per server for
> all tables as the measure of unevenness. It doesn't work in a very common
> scenario in operations. Say we have 100 regions on 50 nodes, two on each. We
> add 50 new nodes and they have 0 each. The max deviation from the mean is 1,
> compared to 99 in the worst case scenario of 100 regions on a single server.
> The normalized cost is 1/99 = 0.011 < default threshold of 0.05. Balancer
> wouldn't move. The proposal is to use aggregated deviation of the count per
> region server to detect this scenario, generating a cost of 100/198 = 0.5 in
> this case.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)