[ 
https://issues.apache.org/jira/browse/HBASE-17178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15702704#comment-15702704
 ] 

Phil Yang commented on HBASE-17178:
-----------------------------------

If I am not wrong, we are going to limit the concurrent RIT regions according 
to the number of plans and the time of this round. But average RIT duration in 
history may not show the current status of cluster. If RIT duration is longer 
than we expected, especially when we are moving too many regions, we may have 
to cut off in this round. So the probability of cutting off is higher than 
before, and may can not reach a balanced state when there are too many plans.

How about limiting the frequency of executing assignmentManager.balance(plan)? 
For example, if we are going to execute 100 plans and the 
hbase.balancer.max.balancing is 100 seconds, we just run plans one by one in 
every second. We may have more concurrent RIT regions if RIT duration is long 
but the probability of cutting off will not be higher than before. And of 
course the sleeping time can be shorter and if there is no RIT regions we can 
start a new plan immediately so we can end up all plans earlier if there is 
only few plans.

> Add region balance throttling
> -----------------------------
>
>                 Key: HBASE-17178
>                 URL: https://issues.apache.org/jira/browse/HBASE-17178
>             Project: HBase
>          Issue Type: Improvement
>          Components: Balancer
>            Reporter: Guanghao Zhang
>            Assignee: Guanghao Zhang
>         Attachments: HBASE-17178-v1.patch
>
>
> Our online cluster serves dozens of  tables and different tables serve for 
> different services. If the balancer moves too many regions in the same time, 
> it will decrease the availability for some table or some services. So we add 
> region balance throttling on our online serve cluster. 
> We introduce a new config hbase.balancer.max.balancing.regions, which means 
> the max number of regions in transition when balancing.
> If we config this to 1 and a table have 100 regions, then the table will have 
> 99 regions available at any time. It helps a lot for our use case and it has 
> been running a long time
> our production cluster.
> But for some use case, we need the balancer run faster. If a cluster has 100 
> regionservers, then it add 50 new regionservers for peak requests. Then it 
> need balancer run as soon as
> possible and let the cluster reach a balance state soon. Our idea is compute 
> max number of regions in transition by the max balancing time and the average 
> time of region in transition.
> Then the balancer use the computed value to throttling.
> Examples for understanding.
> A cluster has 100 regionservers, each regionserver has 200 regions and the 
> average time of region in transition is 1 seconds, we config the max 
> balancing time is 10 * 60 seconds.
> Case 1. One regionserver crash, the cluster at most need balance 200 regions. 
> Then 200 / (10 * 60s / 1s) < 1, it means the max number of regions in 
> transition is 1 when balancing. Then the balancer can move region one by one 
> and the cluster will have high availability  when balancing.
> Case 2. Add other 100 regionservers, the cluster at most need balance 10000 
> regions. Then 10000 / (10 * 60s / 1s) = 16.7, it means the max number of 
> regions in transition is 17 when balancing. Then the cluster can reach a 
> balance state within the max balancing time.
> Any suggestions are welcomed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to