[
https://issues.apache.org/jira/browse/HBASE-21288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16645999#comment-16645999
]
Allan Yang commented on HBASE-21288:
------------------------------------
{quote}
should we return false directly from here?
{quote}
I want to print all the abnormal regions here.
> HostingServer in UnassignProcedure is not accurate
> --------------------------------------------------
>
> Key: HBASE-21288
> URL: https://issues.apache.org/jira/browse/HBASE-21288
> Project: HBase
> Issue Type: Sub-task
> Components: amv2, Balancer
> Affects Versions: 2.1.0, 2.0.2
> Reporter: Allan Yang
> Assignee: Allan Yang
> Priority: Major
> Attachments: HBASE-21288.branch-2.0.001.patch
>
>
> We have a case that a region shows status OPEN on a already dead server in
> meta table(it is hard to trace how this happen), meaning this region is
> actually not online. But balance came and scheduled a MoveReionProcedure for
> this region, which created a mess:
> The balancer 'thought' this region was on the server which has the same
> address(but with different startcode). So it schedules a MRP from this online
> server to another, but the UnassignProcedure dispatch the unassign call to
> the dead server according to regionstate, which then found the server dead
> and schedulre a SCP for the dead server. But since the UnassignProcedure's
> hostingServer is not accurate, the SCP can't interrupt it.
> So, in the end, the SCP can't finish since the UnassignProcedure has the
> region' lock, the UnassignProcedure can finish since no one wake it, thus
> stuck.
> Here is log, notice that the server of the UnassignProcedure is
> 'hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584' but it was
> dispatch to 'hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964'
> {code}
> 2018-10-10 14:34:50,011 INFO [PEWorker-4]
> assignment.RegionTransitionProcedure(252): Dispatch pid=13, ppid=12,
> state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=true; UnassignProcedure
> table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f,
> server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584; rit=CLOSING,
> location=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964
> 2018-10-10 14:34:50,011 WARN [PEWorker-4]
> assignment.RegionTransitionProcedure(230): Remote call failed
> hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964; pid=13, ppid=12,
> state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=true; UnassignProcedure
> table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f,
> server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584; rit=CLOSING,
> location=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964;
> exception=NoServerDispatchException
> org.apache.hadoop.hbase.procedure2.NoServerDispatchException:
> hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964; pid=13, ppid=12,
> state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=true; UnassignProcedure
> table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f,
> server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584
> //Then a SCP was scheduled
> 2018-10-10 14:34:50,012 WARN [PEWorker-4] master.ServerManager(635):
> Expiration of hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964 but
> server not online
> 2018-10-10 14:34:50,012 INFO [PEWorker-4] master.ServerManager(615):
> Processing expiration of hb-uf6oyi699w8h700f0-003.hbase.rds.
> ,16020,1539076734964 on hb-uf6oyi699w8h700f0-001.hbase.rds.
> ,16000,1539088156164
> 2018-10-10 14:34:50,017 DEBUG [PEWorker-4]
> procedure2.ProcedureExecutor(1089): Stored pid=14,
> state=RUNNABLE:SERVER_CRASH_START, hasLock=false; ServerCrashProcedure
> server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964,
> splitWal=true, meta=false
> //The SCP did not interrupt the UnassignProcedure but schedule new
> AssignProcedure for this region
> 2018-10-10 14:34:50,043 DEBUG [PEWorker-6]
> procedure.ServerCrashProcedure(250): Done splitting WALs pid=14,
> state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS, hasLock=true; ServerCrashProcedure
> server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964,
> splitWal=true, meta=false
> 2018-10-10 14:34:50,054 INFO [PEWorker-8]
> procedure2.ProcedureExecutor(1691): Initialized subprocedures=[{pid=15,
> ppid=14, state=RUNNABLE:REGION_TRANSITION_QUEUE, hasLock=false;
> AssignProcedure table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f},
> {pid=16, ppid=14, state=RUNNABLE:REGION_TRANSITION_QUEUE, hasLock=false;
> AssignProcedure table=hbase:req_intercept_rule,
> region=460481706415d776b3742f428a6f579b}, {pid=17, ppid=14,
> state=RUNNABLE:REGION_TRANSITION_QUEUE, hasLock=false; AssignProcedure
> table=hbase:namespace, region=ec7a965e7302840120a5d8289947c40b}]
> {code}
> Here I also added a safe fence in balancer, if such regions are found,
> balancing is skipped for safe.It should do no harm.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)