[jira] [Updated] (HBASE-25697) StochasticBalancer improvement for large scale clusters

Clara Xiong (Jira) Thu, 09 Sep 2021 15:15:11 -0700


     [ 
https://issues.apache.org/jira/browse/HBASE-25697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Clara Xiong updated HBASE-25697:
--------------------------------
    Description: 
h2. Findings on a large scale cluster (100,000 regions on 300 nodes)
 * Balancer starts and stops before getting a plan
 * Adding new racks doesn’t trigger balancer
 * Balancer stops leaving some racks at 50% lower region counts
 * Regions for large tables don’t get evenly distributed
 * Observability is poor
 * Too many knobs makes tuning empirical and takes many experiments

h2. Improvements made and being made
 * Cost function enhancement to capture outliers especially table skew. 
https://issues.apache.org/jira/browse/HBASE-25625?filter=-2 
 * Explain why balancer stops https://issues.apache.org/jira/browse/HBASE-25666 
will back port too https://issues.apache.org/jira/browse/HBASE-24528

h2. More proposals
 * minCostNeedBalance for each cost function instead of weights. We want to 
trigger balancing if any factor is out of balancer instead of trying to combine 
the factors in arbitrary weights. This makes operation and configuration much 
easier.
 * Simulated annealing to lower minCostNeedBalance periodically to unstuck the 
balancer from sub-optimum then gradually increase to keep the system stable. 
Also add cost of move as a counter measure for the decision 
[https://opensourcelibs.com/lib/tempest]
 * Orchestrated scheduling of compaction, normalizer and balancer
 * PID approach [https://www.amazon.com/dp/1449361692/ref=rdr_ext_tmb]

  was:
h2. Findings on a large scale cluster (100,000 regions on 300 nodes)
 * Balancer starts and stops before getting a plan
 * Adding new racks doesn’t trigger balancer
 * Balancer stops leaving some racks at 50% lower region counts
 * Regions for large tables don’t get evenly distributed
 * Observability is poor
 * Too many knobs makes tuning empirical and takes many experiments

h2. Improvements made and bing made
 * Cost function enhancement to capture outliers especially table skew. 
https://issues.apache.org/jira/browse/HBASE-25625?filter=-2 
 * Explain why balancer stops https://issues.apache.org/jira/browse/HBASE-25666 
will back port too https://issues.apache.org/jira/browse/HBASE-24528

h2. More proposals
 * minCostNeedBalance for each cost function instead of weights. We want to 
trigger balancing if any factor is out of balancer instead of trying to combine 
the factors in arbitrary weights. This makes operation and configuration much 
easier.
 * Simulated annealing to lower minCostNeedBalance periodically to unstuck the 
balancer from sub-optimum then gradually increase to keep the system stable. 
Also add cost of move as a counter measure for the decision 
[https://opensourcelibs.com/lib/tempest]
 * Orchestrated scheduling of compaction, normalizer and balancer
 * PID approach [https://www.amazon.com/dp/1449361692/ref=rdr_ext_tmb]


> StochasticBalancer improvement for large scale clusters
> -------------------------------------------------------
>
>                 Key: HBASE-25697
>                 URL: https://issues.apache.org/jira/browse/HBASE-25697
>             Project: HBase
>          Issue Type: Improvement
>          Components: Balancer, master, UI
>            Reporter: Clara Xiong
>            Priority: Major
>
> h2. Findings on a large scale cluster (100,000 regions on 300 nodes)
>  * Balancer starts and stops before getting a plan
>  * Adding new racks doesn’t trigger balancer
>  * Balancer stops leaving some racks at 50% lower region counts
>  * Regions for large tables don’t get evenly distributed
>  * Observability is poor
>  * Too many knobs makes tuning empirical and takes many experiments
> h2. Improvements made and being made
>  * Cost function enhancement to capture outliers especially table skew. 
> https://issues.apache.org/jira/browse/HBASE-25625?filter=-2 
>  * Explain why balancer stops 
> https://issues.apache.org/jira/browse/HBASE-25666 will back port too 
> https://issues.apache.org/jira/browse/HBASE-24528
> h2. More proposals
>  * minCostNeedBalance for each cost function instead of weights. We want to 
> trigger balancing if any factor is out of balancer instead of trying to 
> combine the factors in arbitrary weights. This makes operation and 
> configuration much easier.
>  * Simulated annealing to lower minCostNeedBalance periodically to unstuck 
> the balancer from sub-optimum then gradually increase to keep the system 
> stable. Also add cost of move as a counter measure for the decision 
> [https://opensourcelibs.com/lib/tempest]
>  * Orchestrated scheduling of compaction, normalizer and balancer
>  * PID approach [https://www.amazon.com/dp/1449361692/ref=rdr_ext_tmb]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HBASE-25697) StochasticBalancer improvement for large scale clusters

Reply via email to