[
https://issues.apache.org/jira/browse/HBASE-21260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16650422#comment-16650422
]
Hudson commented on HBASE-21260:
--------------------------------
Results for branch branch-2.0
[build #954 on
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/954/]:
(x) *{color:red}-1 overall{color}*
----
details (if available):
(/) {color:green}+1 general checks{color}
-- For more information [see general
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/954//General_Nightly_Build_Report/]
(/) {color:green}+1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2)
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/954//JDK8_Nightly_Build_Report_(Hadoop2)/]
(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3)
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/954//JDK8_Nightly_Build_Report_(Hadoop3)/]
(/) {color:green}+1 source release artifact{color}
-- See build output for details.
> The whole balancer plans might be aborted if there are more than one plans to
> move a same region
> -------------------------------------------------------------------------------------------------
>
> Key: HBASE-21260
> URL: https://issues.apache.org/jira/browse/HBASE-21260
> Project: HBase
> Issue Type: Bug
> Components: Balancer, master
> Reporter: Xiaolin Ha
> Assignee: Xiaolin Ha
> Priority: Major
> Fix For: 3.0.0, 2.2.0, 2.1.1, 2.0.3
>
> Attachments: HBASE-21260.branch-2.001.patch,
> HBASE-21260.branch-2.002.patch
>
>
> In SimpleLoadBalancer, plans are generated firstly by average number regions
> per server for a table. Each server will be randomly assigned either
> floor(average) or ceiling(average) regions (if the average is not an integer
> number). But afterwards, the balanceOverall method might generate new plans
> of some regions of the table to balance server loads in whole cluster scope.
> As a result, there are plans to move a same region in one call of balance.
> Currently, branch-2 is using async procedures to implement balancer plans.
> But the concurrency of moving the same regions will cause the balance method
> failed. And all the afterwards plans will not be implement when one plan
> encounters exception.
> We have encountered this problem in our practices, the logs are as follows,
> {color:#205081}2018-09-26,12:12:38,224 INFO
> [master/c4-hadoop-tst-ct15:52900.Chore.1]
> org.apache.hadoop.hbase.master.HMaster: Balancer plans size is 3757, the
> balance interval is 79 ms, and the max number regions in transition is 25
> 2018-09-26,12:12:38,224 INFO [master/c4-hadoop-tst-ct15:52900.Chore.1]
> org.apache.hadoop.hbase.master.HMaster: balance hri=1588230740,
> source=c4-hadoop-tst-st99.bj,52900,1537522783781,
> destination=c4-hadoop-tst-st28.bj,52900,1537520009497
> 2018-09-26,12:12:38,325 INFO [master/c4-hadoop-tst-ct15:52900.Chore.1]
> org.apache.hadoop.hbase.master.HMaster: balance hri=1588230740,
> source=c4-hadoop-tst-st99.bj,52900,1537522783781,
> destination=c4-hadoop-tst-st29.bj,52900,1537522784188
> 2018-09-26,12:12:38,325 INFO [PEWorker-16]
> org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler:
> pid=119197, state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE;
> TransitRegionStateProcedure table=hbase:meta, region=1588230740, REOPEN/MOVE
> checking lock on 1588230740
> 2018-09-26,12:12:38,325 ERROR [master/c4-hadoop-tst-ct15:52900.Chore.1]
> org.apache.hadoop.hbase.master.balancer.BalancerChore: Failed to balance.
> org.apache.hadoop.hbase.HBaseIOException: rit=OPEN,
> location=c4-hadoop-tst-st99.bj,52900,1537522783781, table=hbase:meta,
> region=1588230740 is currently in transition
> at
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.preTransitCheck(AssignmentManager.java:536)
> at
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.createMoveRegionProcedure(AssignmentManager.java:592)
> at
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.moveAsync(AssignmentManager.java:609)
> at org.apache.hadoop.hbase.master.HMaster.balance(HMaster.java:1707)
> at org.apache.hadoop.hbase.master.HMaster.balance(HMaster.java:1622)
> at
> org.apache.hadoop.hbase.master.balancer.BalancerChore.chore(BalancerChore.java:49)
> at org.apache.hadoop.hbase.ScheduledChore.run(ScheduledChore.java:186)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at
> org.apache.hadoop.hbase.JitterScheduledThreadPoolExecutorImpl$JitteredRunnableScheduledFuture.run(JitterScheduledThreadPoolExecutorImpl.java:111)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745){color}
> This is a serious problem because it often occurs when new RSs started or old
> RSs failover. And what's more, no effective methods can be used to make the
> balance of the cluster back to normal.
> But the solution of this problem may be simple. We can cache Exceptions when
> implementing a plan, and then just skip it, avoiding failed plans effect
> later plans in the whole plans list. New calls of balance can fetch up the
> failed and skipped plans.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)