[ 
https://issues.apache.org/jira/browse/HBASE-18164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kahlil Oppenheimer updated HBASE-18164:
---------------------------------------
    Description: 
We noticed that during the stochastic load balancer was not scaling well with 
cluster size. That is to say that on our smaller clusters (~17 tables, ~12 
region servers, ~5k regions), the balancer considers ~100,000 cluster 
configurations in 60s per balancer run, but only ~5,000 per 60s on our bigger 
clusters (~82 tables, ~160 region servers, ~13k regions) .

Because of this, our bigger clusters are not able to converge on balance as 
quickly for things like table skew, region load, etc. because the balancer does 
not have enough time to "think".

We have re-written the locality cost function to be incremental, meaning it 
only recomputes cost based on the most recent region move proposed by the 
balancer, rather than recomputing the cost across all regions/servers every 
iteration.

Further, we also cache the locality of every region on every server at the 
beginning of the balancer's execution for both the LocalityBasedCostFunction 
and the LocalityCandidateGenerator to reference. This way, they need not 
collect all HDFS blocks of every region at each iteration of the balancer.

The changes have been running in all 6 of our production clusters and all 4 QA 
clusters without issue. The speed improvements we noticed are massive. Our big 
clusters now consider 20x more cluster configurations.

We are currently preparing a patch for submission.

  was:
We noticed that during the stochastic load balancer was not scaling well with 
cluster size. That is to say that on our smaller clusters (~17 tables, ~12 
region servers, ~5k regions), the balancer considers ~100,000 cluster 
configurations in 60s per balancer run, but only ~5,000 per 60s on our bigger 
clusters (~82 tables, ~160 region servers, ~13k regions) .

Because of this, our bigger clusters are not able to converge on balance as 
quickly for things like table skew, region load, etc. because the balancer does 
not have enough time to "think".

We have re-written the locality cost function to be incremental, meaning it 
only recomputes cost based on the most recent region move proposed by the 
balancer, rather than recomputing the cost across all regions/servers every 
iteration.

Further, we also cache the locality of every region on every server at the 
beginning of the balancer's execution for both the LocalityBasedCostFunction 
and the LocalityCandidateGenerator to reference. This way, they need not 
collect all HDFS blocks of every region at each iteration of the balancer.

The changes have been running in all 6 of our production clusters and all 4 QA 
clusters without issue. The speed improvements we noticed are massive. Our big 
clusters now consider 20x more cluster configurations.

One other important design decision we made was to compute locality cost as a 
measure of how good locality currently is compared to the best it could be 
(given the current cluster state). The old cost function assumed that 100% 
locality was always possible, and calculated the cost as the difference from 
that state. Instead, this new computation computes the difference from the 
actual best locality found across the cluster. This means that if a 
region-server has 75% locality for a given region, but that region has less 
than 75% locality o all other servers, then the locality cost associated with 
that region is 0.


> Much faster locality cost function and candidate generator
> ----------------------------------------------------------
>
>                 Key: HBASE-18164
>                 URL: https://issues.apache.org/jira/browse/HBASE-18164
>             Project: HBase
>          Issue Type: Improvement
>          Components: Balancer
>            Reporter: Kahlil Oppenheimer
>            Assignee: Kahlil Oppenheimer
>            Priority: Critical
>         Attachments: HBASE-18164-00.patch
>
>
> We noticed that during the stochastic load balancer was not scaling well with 
> cluster size. That is to say that on our smaller clusters (~17 tables, ~12 
> region servers, ~5k regions), the balancer considers ~100,000 cluster 
> configurations in 60s per balancer run, but only ~5,000 per 60s on our bigger 
> clusters (~82 tables, ~160 region servers, ~13k regions) .
> Because of this, our bigger clusters are not able to converge on balance as 
> quickly for things like table skew, region load, etc. because the balancer 
> does not have enough time to "think".
> We have re-written the locality cost function to be incremental, meaning it 
> only recomputes cost based on the most recent region move proposed by the 
> balancer, rather than recomputing the cost across all regions/servers every 
> iteration.
> Further, we also cache the locality of every region on every server at the 
> beginning of the balancer's execution for both the LocalityBasedCostFunction 
> and the LocalityCandidateGenerator to reference. This way, they need not 
> collect all HDFS blocks of every region at each iteration of the balancer.
> The changes have been running in all 6 of our production clusters and all 4 
> QA clusters without issue. The speed improvements we noticed are massive. Our 
> big clusters now consider 20x more cluster configurations.
> We are currently preparing a patch for submission.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to