[ 
https://issues.apache.org/jira/browse/HBASE-21288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allan Yang updated HBASE-21288:
-------------------------------
    Attachment: HBASE-21288.branch-2.0.001.patch

> HostingServer in UnassignProcedure is not accurate
> --------------------------------------------------
>
>                 Key: HBASE-21288
>                 URL: https://issues.apache.org/jira/browse/HBASE-21288
>             Project: HBase
>          Issue Type: Sub-task
>          Components: amv2, Balancer
>    Affects Versions: 2.1.0, 2.0.2
>            Reporter: Allan Yang
>            Assignee: Allan Yang
>            Priority: Major
>         Attachments: HBASE-21288.branch-2.0.001.patch
>
>
> We have a case that a region shows status OPEN on a already dead server in 
> meta table(it is hard to trace how this happen), meaning this region is 
> actually not online. But balance came and scheduled a MoveReionProcedure for 
> this region, which created a mess:
> The balancer 'thought' this region was on the server which has the same 
> address(but with different startcode). So it schedules a MRP from this online 
> server to another, but the UnassignProcedure dispatch the unassign call to 
> the dead server according to regionstate, which then found the server dead 
> and schedulre a SCP for the dead server. But since the UnassignProcedure's 
> hostingServer is not accurate, the SCP can't interrupt it.
> So, in the end, the SCP can't finish since the UnassignProcedure has the 
> region' lock, the UnassignProcedure can finish since no one wake it, thus 
> stuck.
> Here is log, notice that the server of the UnassignProcedure is 
> 'hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584' but it was 
> dispatch to 'hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964'
> {code}
> 2018-10-10 14:34:50,011 INFO  [PEWorker-4] 
> assignment.RegionTransitionProcedure(252): Dispatch pid=13, ppid=12, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=true; UnassignProcedure 
> table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f, 
> server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584; rit=CLOSING, 
> location=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964
> 2018-10-10 14:34:50,011 WARN  [PEWorker-4] 
> assignment.RegionTransitionProcedure(230): Remote call failed 
> hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964; pid=13, ppid=12, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=true; UnassignProcedure 
> table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f, 
> server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584; rit=CLOSING, 
> location=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964; 
> exception=NoServerDispatchException
> org.apache.hadoop.hbase.procedure2.NoServerDispatchException: 
> hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964; pid=13, ppid=12, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=true; UnassignProcedure 
> table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f, 
> server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584
> //Then a SCP was scheduled
> 2018-10-10 14:34:50,012 WARN  [PEWorker-4] master.ServerManager(635): 
> Expiration of hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964 but 
> server not online
> 2018-10-10 14:34:50,012 INFO  [PEWorker-4] master.ServerManager(615): 
> Processing expiration of hb-uf6oyi699w8h700f0-003.hbase.rds. 
> ,16020,1539076734964 on hb-uf6oyi699w8h700f0-001.hbase.rds. 
> ,16000,1539088156164
> 2018-10-10 14:34:50,017 DEBUG [PEWorker-4] 
> procedure2.ProcedureExecutor(1089): Stored pid=14, 
> state=RUNNABLE:SERVER_CRASH_START, hasLock=false; ServerCrashProcedure 
> server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964, 
> splitWal=true, meta=false
> //The SCP did not interrupt the UnassignProcedure but schedule new 
> AssignProcedure for this region
> 2018-10-10 14:34:50,043 DEBUG [PEWorker-6] 
> procedure.ServerCrashProcedure(250): Done splitting WALs pid=14, 
> state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS, hasLock=true; ServerCrashProcedure 
> server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964, 
> splitWal=true, meta=false
> 2018-10-10 14:34:50,054 INFO  [PEWorker-8] 
> procedure2.ProcedureExecutor(1691): Initialized subprocedures=[{pid=15, 
> ppid=14, state=RUNNABLE:REGION_TRANSITION_QUEUE, hasLock=false; 
> AssignProcedure table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f}, 
> {pid=16, ppid=14, state=RUNNABLE:REGION_TRANSITION_QUEUE, hasLock=false; 
> AssignProcedure table=hbase:req_intercept_rule, 
> region=460481706415d776b3742f428a6f579b}, {pid=17, ppid=14, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE, hasLock=false; AssignProcedure 
> table=hbase:namespace, region=ec7a965e7302840120a5d8289947c40b}]
> {code}
> Here I also added a safe fence in balancer, if such regions are found, 
> balancing is skipped for safe.It should do no harm.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to