[ 
https://issues.apache.org/jira/browse/HBASE-21307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16648588#comment-16648588
 ] 

stack commented on HBASE-21307:
-------------------------------

Eventually the rollback fails as follows still complaining the region is owned 
by another:

{code}
rocedureAbortedException: f5f9ff1e4b0f2d9555dabfcca71df568 owned by pid=411982, 
CANNOT run 'this' (pid=412210).; ServerCrashProcedure 
server=va1002.halxg.cloudera.com,22101,1539237389315, splitWal=true, meta=false
java.lang.UnsupportedOperationException: unhandled 
state=SERVER_CRASH_HANDLE_RIT2
        at 
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:262)
        at 
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:59)
        at 
org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:208)
        at 
org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:970)
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1618)
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1580)
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1451)
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:75)
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2022)
2018-10-12 16:33:09,150 DEBUG 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: pid=412210, ppid=411983, 
state=ROLLEDBACK, 
exception=org.apache.hadoop.hbase.procedure2.ProcedureAbortedException via 
AssignProcedure:org.apache.hadoop.hbase.procedure2.ProcedureAbortedException: 
f5f9ff1e4b0f2d9555dabfcca71df568 owned by pid=411982, CANNOT run 'this' 
(pid=412210).; AssignProcedure 
table=IntegrationTestBigLinkedList_20180709093726, 
region=f5f9ff1e4b0f2d9555dabfcca71df568 is already finished, skipping execution
2018-10-12 16:35:10,099 DEBUG 
org.apache.hadoop.hbase.regionserver.ChunkCreator: data stats (chunk 
size=2097152): current pool size=0, created chunk count=0, reused chunk 
count=0, reuseRatio=0
{code}

This holds up the general assign until its unblocked.

> [amv2] Deadlock when we move a Region from a not-online RegionServer
> --------------------------------------------------------------------
>
>                 Key: HBASE-21307
>                 URL: https://issues.apache.org/jira/browse/HBASE-21307
>             Project: HBase
>          Issue Type: Bug
>          Components: amv2
>    Affects Versions: 2.1.1
>            Reporter: stack
>            Assignee: stack
>            Priority: Major
>             Fix For: 2.1.1
>
>
> Perhaps this doesn't happen in branch-2, but its problem in branch-2.1.
> Highlevel, we go to move a region, its unassign subprocedure fails its 
> dispatch because the server is not online so it queues a SCP and waits on it 
> to break the RPC. The SCP can't run though because the MRP holds lock on the 
> region.
> I can bypass the MRP but then the SCP fails because Region is 'owned' by the 
> MRP. See below:
> {code}
> 2018-10-12 16:29:53,423 INFO 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Begin bypass 
> pid=411982, ppid=411981, state=RUNNABLE:REGION_TRANSITION_DISPATCH, 
> locked=true; UnassignProcedure 
> table=IntegrationTestBigLinkedList_20180709093726, 
> region=f5f9ff1e4b0f2d9555dabfcca71df568, override=true, 
> server=va1002.halxg.cloudera.com,22101,1539368318649 with lockWait=0, 
> override=true, recursive=true
> 2018-10-12 16:29:53,424 INFO 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Bypassing pid=411982, 
> ppid=411981, state=RUNNABLE:REGION_TRANSITION_DISPATCH, locked=true; 
> UnassignProcedure table=IntegrationTestBigLinkedList_20180709093726, 
> region=f5f9ff1e4b0f2d9555dabfcca71df568, override=true, 
> server=va1002.halxg.cloudera.com,22101,1539368318649
> 2018-10-12 16:29:53,712 INFO 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Bypassing pid=411981, 
> state=WAITING:MOVE_REGION_ASSIGN, locked=true; MoveRegionProcedure 
> hri=f5f9ff1e4b0f2d9555dabfcca71df568, 
> source=va1002.halxg.cloudera.com,22101,1539368318649, 
> destination=vd1021.halxg.cloudera.com,22101,1539368317897
> 2018-10-12 16:29:53,838 INFO 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Bypassing pid=411982, 
> ppid=411981, state=RUNNABLE:REGION_TRANSITION_DISPATCH, locked=true, 
> bypass=LOG-REDACTED UnassignProcedure 
> table=IntegrationTestBigLinkedList_20180709093726, 
> region=f5f9ff1e4b0f2d9555dabfcca71df568, override=true, 
> server=va1002.halxg.cloudera.com,22101,1539368318649 and its ancestors 
> successfully, adding to queue
> 2018-10-12 16:29:53,839 INFO org.apache.hadoop.hbase.procedure2.Procedure: 
> pid=411982, ppid=411981, state=RUNNABLE:REGION_TRANSITION_DISPATCH, 
> locked=true, bypass=LOG-REDACTED UnassignProcedure 
> table=IntegrationTestBigLinkedList_20180709093726, 
> region=f5f9ff1e4b0f2d9555dabfcca71df568, override=true, 
> server=va1002.halxg.cloudera.com,22101,1539368318649 bypassed, returning null 
> to finish it
> 2018-10-12 16:29:53,954 INFO 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished subprocedure 
> pid=411982, resume processing parent pid=411981, 
> state=RUNNABLE:MOVE_REGION_ASSIGN, locked=true, bypass=LOG-REDACTED 
> MoveRegionProcedure hri=f5f9ff1e4b0f2d9555dabfcca71df568, 
> source=va1002.halxg.cloudera.com,22101,1539368318649, 
> destination=vd1021.halxg.cloudera.com,22101,1539368317897
> 2018-10-12 16:29:53,954 INFO org.apache.hadoop.hbase.procedure2.Procedure: 
> pid=411981, state=RUNNABLE:MOVE_REGION_ASSIGN, locked=true, 
> bypass=LOG-REDACTED MoveRegionProcedure hri=f5f9ff1e4b0f2d9555dabfcca71df568, 
> source=va1002.halxg.cloudera.com,22101,1539368318649, 
> destination=vd1021.halxg.cloudera.com,22101,1539368317897 bypassed, returning 
> null to finish it
> 2018-10-12 16:29:53,956 INFO 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=411982, 
> ppid=411981, state=SUCCESS, bypass=LOG-REDACTED UnassignProcedure 
> table=IntegrationTestBigLinkedList_20180709093726, 
> region=f5f9ff1e4b0f2d9555dabfcca71df568, override=true, 
> server=va1002.halxg.cloudera.com,22101,1539368318649 in 3hrs, 49mins, 
> 12.419sec, unfinishedSiblingCount=0
> 2018-10-12 16:29:54,058 INFO 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=411981, 
> state=SUCCESS, bypass=LOG-REDACTED MoveRegionProcedure 
> hri=f5f9ff1e4b0f2d9555dabfcca71df568, 
> source=va1002.halxg.cloudera.com,22101,1539368318649, 
> destination=vd1021.halxg.cloudera.com,22101,1539368317897 in 3hrs, 49mins, 
> 12.878sec
> 2018-10-12 16:29:54,059 INFO 
> org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: xlock for 
> pid=412210, ppid=411983, state=RUNNABLE:REGION_TRANSITION_QUEUE; 
> AssignProcedure table=IntegrationTestBigLinkedList_20180709093726, 
> region=f5f9ff1e4b0f2d9555dabfcca71df568
> 2018-10-12 16:29:54,105 WARN 
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure: 
> f5f9ff1e4b0f2d9555dabfcca71df568 owned by pid=411982, CANNOT run 'this' 
> (pid=412210).
> ....
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to