[ 
https://issues.apache.org/jira/browse/HBASE-24139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Busbey updated HBASE-24139:
--------------------------------
    Description: 
After HBASE-15529 the StochasticLoadBalancer makes the decision to run based on 
its internal cost functions rather than the simple region count skew of 
BaseLoadBalancer.

Given the default weights for those cost functions, the default minimum cost to 
indicate a need to rebalance, and a regions per region server density of ~90 we 
are not very responsive to adding additional region servers for non-trivial 
cluster sizes:

* For clusters ~10 nodes, the defaults think a single RS at 0 regions means we 
need to balance
* For clusters >20 nodes, the defaults will not consider a single RS at 0 
regions to mean we need to balance. 2 RS at 0 will cause it to balance.
* For clusters ~100 nodes, having 6 RS with no regions will still not meet the 
threshold to cause a balance.

Note that this is the decision to look at balancer plans at all. The 
calculation is severely dominated by the region count skew (it has weight 500 
and all other weights are ~105), so barring a very significant change in all 
other cost functions this condition will persist indefinitely.

Two possible approaches:

* add a new cost function that's essentially "don't have RS with 0 regions" 
that an operator can tune
* add a short circuit condition for the {{needsBalance}} method that checks for 
empty RS similar to the check we do for colocated region replicas

For those currently hitting this an easy work around is to set 
{{hbase.master.balancer.stochastic.minCostNeedBalance}} to {{0.01}}. This will 
mean that a single RS having 0 regions will cause the balancer to run for 
clusters of up to ~90 region servers. It's essentially the same as the default 
slop of 0.01 used by the BaseLoadBalancer.

  was:
After HBASE-15529 the StochasticLoadBalancer makes the decision to run based on 
its internal cost functions rather than the simple region count skew of 
BaseLoadBalancer.

Given the default weights for those cost functions, the default minimum cost to 
indicate a need to rebalance, and a regions per region server density of ~90 we 
are not very responsive to adding additional region servers for non-trivial 
cluster sizes:

* For clusters ~10 nodes, the defaults think a single RS at 0 regions means we 
need to balance
* For clusters >20 nodes, the defaults will not consider a single RS at 0 
regions to mean we need to balance. 2 RS at 0 will cause it to balance.
* For clusters ~100 nodes, having 6 RS with no regions will still not meet the 
threshold to cause a balance.

Note that this is the decision to look at balancer plans at all. The 
calculation is severely dominated by the region count skew (it has weight 500 
and all other weights are ~105), so barring a very significant change in all 
other cost functions this condition will persist indefinitely.

Two possible approaches:

* add a new cost function that's essentially "don't have RS with 0 regions" 
that an operator can tune
* add a short circuit condition for the {{needsBalance}} method that checks for 
empty RS similar to the check we do for colocated region replicas


> Balancer should avoid leaving idle region servers
> -------------------------------------------------
>
>                 Key: HBASE-24139
>                 URL: https://issues.apache.org/jira/browse/HBASE-24139
>             Project: HBase
>          Issue Type: Improvement
>          Components: Balancer, Operability
>            Reporter: Sean Busbey
>            Assignee: Sean Busbey
>            Priority: Critical
>
> After HBASE-15529 the StochasticLoadBalancer makes the decision to run based 
> on its internal cost functions rather than the simple region count skew of 
> BaseLoadBalancer.
> Given the default weights for those cost functions, the default minimum cost 
> to indicate a need to rebalance, and a regions per region server density of 
> ~90 we are not very responsive to adding additional region servers for 
> non-trivial cluster sizes:
> * For clusters ~10 nodes, the defaults think a single RS at 0 regions means 
> we need to balance
> * For clusters >20 nodes, the defaults will not consider a single RS at 0 
> regions to mean we need to balance. 2 RS at 0 will cause it to balance.
> * For clusters ~100 nodes, having 6 RS with no regions will still not meet 
> the threshold to cause a balance.
> Note that this is the decision to look at balancer plans at all. The 
> calculation is severely dominated by the region count skew (it has weight 500 
> and all other weights are ~105), so barring a very significant change in all 
> other cost functions this condition will persist indefinitely.
> Two possible approaches:
> * add a new cost function that's essentially "don't have RS with 0 regions" 
> that an operator can tune
> * add a short circuit condition for the {{needsBalance}} method that checks 
> for empty RS similar to the check we do for colocated region replicas
> For those currently hitting this an easy work around is to set 
> {{hbase.master.balancer.stochastic.minCostNeedBalance}} to {{0.01}}. This 
> will mean that a single RS having 0 regions will cause the balancer to run 
> for clusters of up to ~90 region servers. It's essentially the same as the 
> default slop of 0.01 used by the BaseLoadBalancer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to