Allan Yang created HBASE-21288:
----------------------------------
Summary: HostingServer in UnassignProcedure is not accurate
Key: HBASE-21288
URL: https://issues.apache.org/jira/browse/HBASE-21288
Project: HBase
Issue Type: Sub-task
Components: amv2, Balancer
Affects Versions: 2.0.2, 2.1.0
Reporter: Allan Yang
Assignee: Allan Yang
We have a case that a region shows status OPEN on a already dead server in meta
table(it is hard to trace how this happen), meaning this region is actually not
online. But balance came and scheduled a MoveReionProcedure for this region,
which created a mess:
The balancer 'thought' this region was on the server which has the same
address(but with different startcode). So it schedules a MRP from this online
server to another, but the UnassignProcedure dispatch the unassign call to the
dead server according to regionstate, which then found the server dead and
schedulre a SCP for the dead server. But since the UnassignProcedure's
hostingServer is not accurate, the SCP can't interrupt it.
So, in the end, the SCP can't finish since the UnassignProcedure has the
region' lock, the UnassignProcedure can finish since no one wake it, thus stuck.
Here is log, notice that the server of the UnassignProcedure is
'hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584' but it was dispatch
to 'hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964'
{code}
2018-10-10 14:34:50,011 INFO [PEWorker-4]
assignment.RegionTransitionProcedure(252): Dispatch pid=13, ppid=12,
state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=true; UnassignProcedure
table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f,
server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584; rit=CLOSING,
location=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964
2018-10-10 14:34:50,011 WARN [PEWorker-4]
assignment.RegionTransitionProcedure(230): Remote call failed
hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964; pid=13, ppid=12,
state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=true; UnassignProcedure
table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f,
server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584; rit=CLOSING,
location=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964;
exception=NoServerDispatchException
org.apache.hadoop.hbase.procedure2.NoServerDispatchException:
hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964; pid=13, ppid=12,
state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=true; UnassignProcedure
table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f,
server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584
//Then a SCP was scheduled
2018-10-10 14:34:50,012 WARN [PEWorker-4] master.ServerManager(635):
Expiration of hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964 but
server not online
2018-10-10 14:34:50,012 INFO [PEWorker-4] master.ServerManager(615):
Processing expiration of hb-uf6oyi699w8h700f0-003.hbase.rds.
,16020,1539076734964 on hb-uf6oyi699w8h700f0-001.hbase.rds. ,16000,1539088156164
2018-10-10 14:34:50,017 DEBUG [PEWorker-4] procedure2.ProcedureExecutor(1089):
Stored pid=14, state=RUNNABLE:SERVER_CRASH_START, hasLock=false;
ServerCrashProcedure server=hb-uf6oyi699w8h700f0-003.hbase.rds.
,16020,1539076734964, splitWal=true, meta=false
//The SCP did not interrupt the UnassignProcedure but schedule new
AssignProcedure for this region
2018-10-10 14:34:50,043 DEBUG [PEWorker-6] procedure.ServerCrashProcedure(250):
Done splitting WALs pid=14, state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS,
hasLock=true; ServerCrashProcedure server=hb-uf6oyi699w8h700f0-003.hbase.rds.
,16020,1539076734964, splitWal=true, meta=false
2018-10-10 14:34:50,054 INFO [PEWorker-8] procedure2.ProcedureExecutor(1691):
Initialized subprocedures=[{pid=15, ppid=14,
state=RUNNABLE:REGION_TRANSITION_QUEUE, hasLock=false; AssignProcedure
table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f}, {pid=16, ppid=14,
state=RUNNABLE:REGION_TRANSITION_QUEUE, hasLock=false; AssignProcedure
table=hbase:req_intercept_rule, region=460481706415d776b3742f428a6f579b},
{pid=17, ppid=14, state=RUNNABLE:REGION_TRANSITION_QUEUE, hasLock=false;
AssignProcedure table=hbase:namespace, region=ec7a965e7302840120a5d8289947c40b}]
{code}
Here I also added a safe fence in balancer, if such regions are found,
balancing is skipped for safe.It should do no harm.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)