Xiaolin Ha created HBASE-21260:
----------------------------------

             Summary: The whole balancer plans might be aborted if there are 
more than one plans to move the same region 
                 Key: HBASE-21260
                 URL: https://issues.apache.org/jira/browse/HBASE-21260
             Project: HBase
          Issue Type: Bug
          Components: Balancer, master
    Affects Versions: 2.0.0, 2.1.0
            Reporter: Xiaolin Ha
            Assignee: Xiaolin Ha


In SimpleLoadBalancer, plans are generated firstly by average number regions 
per server for a table. Each server will be randomly assigned either 
floor(average) or ceiling(average) regions (if the average is not an integer 
number). But afterwards, the balanceOverall method might generate new plans of 
some regions of the table to balance server loads in whole cluster scope. As a 
result, there are plans to move a same region in one call of balance. 

Currently, branch-2 is using async procedures to implement balancer plans. But 
the concurrency of moving the same regions will cause the balance method 
failed. And all the afterwards plans will not be implement when one plan 
encounters exception.
We have encountered this problem in our practices, the logs are as follows,

{color:#205081}2018-09-26,12:12:38,224 INFO 
[master/c4-hadoop-tst-ct15:52900.Chore.1] 
org.apache.hadoop.hbase.master.HMaster: Balancer plans size is 3757, the 
balance interval is 79 ms, and the max number regions in transition is 25
2018-09-26,12:12:38,224 INFO [master/c4-hadoop-tst-ct15:52900.Chore.1] 
org.apache.hadoop.hbase.master.HMaster: balance hri=1588230740, 
source=c4-hadoop-tst-st99.bj,52900,1537522783781, 
destination=c4-hadoop-tst-st28.bj,52900,1537520009497
2018-09-26,12:12:38,325 INFO [master/c4-hadoop-tst-ct15:52900.Chore.1] 
org.apache.hadoop.hbase.master.HMaster: balance hri=1588230740, 
source=c4-hadoop-tst-st99.bj,52900,1537522783781, 
destination=c4-hadoop-tst-st29.bj,52900,1537522784188
2018-09-26,12:12:38,325 INFO [PEWorker-16] 
org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: pid=119197, 
state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; TransitRegionStateProcedure 
table=hbase:meta, region=1588230740, REOPEN/MOVE checking lock on 1588230740
2018-09-26,12:12:38,325 ERROR [master/c4-hadoop-tst-ct15:52900.Chore.1] 
org.apache.hadoop.hbase.master.balancer.BalancerChore: Failed to balance.
org.apache.hadoop.hbase.HBaseIOException: rit=OPEN, 
location=c4-hadoop-tst-st99.bj,52900,1537522783781, table=hbase:meta, 
region=1588230740 is currently in transition
        at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.preTransitCheck(AssignmentManager.java:536)
        at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.createMoveRegionProcedure(AssignmentManager.java:592)
        at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.moveAsync(AssignmentManager.java:609)
        at org.apache.hadoop.hbase.master.HMaster.balance(HMaster.java:1707)
        at org.apache.hadoop.hbase.master.HMaster.balance(HMaster.java:1622)
        at 
org.apache.hadoop.hbase.master.balancer.BalancerChore.chore(BalancerChore.java:49)
        at org.apache.hadoop.hbase.ScheduledChore.run(ScheduledChore.java:186)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at 
org.apache.hadoop.hbase.JitterScheduledThreadPoolExecutorImpl$JitteredRunnableScheduledFuture.run(JitterScheduledThreadPoolExecutorImpl.java:111)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745){color}

This is a serious problem because it often occurs when new RSs started or old 
RSs failover. And what's more, no effective methods can be used to make the 
balance of the cluster back to normal.

But the solution of this problem may be simple. We can cache Exceptions when 
implementing a plan, and then just skip it, avoiding failed plans effect later 
plans in the whole plans list. New calls of balance can fetch up the failed and 
skipped plans.



 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to