[ https://issues.apache.org/jira/browse/HBASE-21307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16648588#comment-16648588 ]
stack commented on HBASE-21307: ------------------------------- Eventually the rollback fails as follows still complaining the region is owned by another: {code} rocedureAbortedException: f5f9ff1e4b0f2d9555dabfcca71df568 owned by pid=411982, CANNOT run 'this' (pid=412210).; ServerCrashProcedure server=va1002.halxg.cloudera.com,22101,1539237389315, splitWal=true, meta=false java.lang.UnsupportedOperationException: unhandled state=SERVER_CRASH_HANDLE_RIT2 at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:262) at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:59) at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:208) at org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:970) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1618) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1580) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1451) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:75) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2022) 2018-10-12 16:33:09,150 DEBUG org.apache.hadoop.hbase.procedure2.ProcedureExecutor: pid=412210, ppid=411983, state=ROLLEDBACK, exception=org.apache.hadoop.hbase.procedure2.ProcedureAbortedException via AssignProcedure:org.apache.hadoop.hbase.procedure2.ProcedureAbortedException: f5f9ff1e4b0f2d9555dabfcca71df568 owned by pid=411982, CANNOT run 'this' (pid=412210).; AssignProcedure table=IntegrationTestBigLinkedList_20180709093726, region=f5f9ff1e4b0f2d9555dabfcca71df568 is already finished, skipping execution 2018-10-12 16:35:10,099 DEBUG org.apache.hadoop.hbase.regionserver.ChunkCreator: data stats (chunk size=2097152): current pool size=0, created chunk count=0, reused chunk count=0, reuseRatio=0 {code} This holds up the general assign until its unblocked. > [amv2] Deadlock when we move a Region from a not-online RegionServer > -------------------------------------------------------------------- > > Key: HBASE-21307 > URL: https://issues.apache.org/jira/browse/HBASE-21307 > Project: HBase > Issue Type: Bug > Components: amv2 > Affects Versions: 2.1.1 > Reporter: stack > Assignee: stack > Priority: Major > Fix For: 2.1.1 > > > Perhaps this doesn't happen in branch-2, but its problem in branch-2.1. > Highlevel, we go to move a region, its unassign subprocedure fails its > dispatch because the server is not online so it queues a SCP and waits on it > to break the RPC. The SCP can't run though because the MRP holds lock on the > region. > I can bypass the MRP but then the SCP fails because Region is 'owned' by the > MRP. See below: > {code} > 2018-10-12 16:29:53,423 INFO > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Begin bypass > pid=411982, ppid=411981, state=RUNNABLE:REGION_TRANSITION_DISPATCH, > locked=true; UnassignProcedure > table=IntegrationTestBigLinkedList_20180709093726, > region=f5f9ff1e4b0f2d9555dabfcca71df568, override=true, > server=va1002.halxg.cloudera.com,22101,1539368318649 with lockWait=0, > override=true, recursive=true > 2018-10-12 16:29:53,424 INFO > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Bypassing pid=411982, > ppid=411981, state=RUNNABLE:REGION_TRANSITION_DISPATCH, locked=true; > UnassignProcedure table=IntegrationTestBigLinkedList_20180709093726, > region=f5f9ff1e4b0f2d9555dabfcca71df568, override=true, > server=va1002.halxg.cloudera.com,22101,1539368318649 > 2018-10-12 16:29:53,712 INFO > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Bypassing pid=411981, > state=WAITING:MOVE_REGION_ASSIGN, locked=true; MoveRegionProcedure > hri=f5f9ff1e4b0f2d9555dabfcca71df568, > source=va1002.halxg.cloudera.com,22101,1539368318649, > destination=vd1021.halxg.cloudera.com,22101,1539368317897 > 2018-10-12 16:29:53,838 INFO > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Bypassing pid=411982, > ppid=411981, state=RUNNABLE:REGION_TRANSITION_DISPATCH, locked=true, > bypass=LOG-REDACTED UnassignProcedure > table=IntegrationTestBigLinkedList_20180709093726, > region=f5f9ff1e4b0f2d9555dabfcca71df568, override=true, > server=va1002.halxg.cloudera.com,22101,1539368318649 and its ancestors > successfully, adding to queue > 2018-10-12 16:29:53,839 INFO org.apache.hadoop.hbase.procedure2.Procedure: > pid=411982, ppid=411981, state=RUNNABLE:REGION_TRANSITION_DISPATCH, > locked=true, bypass=LOG-REDACTED UnassignProcedure > table=IntegrationTestBigLinkedList_20180709093726, > region=f5f9ff1e4b0f2d9555dabfcca71df568, override=true, > server=va1002.halxg.cloudera.com,22101,1539368318649 bypassed, returning null > to finish it > 2018-10-12 16:29:53,954 INFO > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished subprocedure > pid=411982, resume processing parent pid=411981, > state=RUNNABLE:MOVE_REGION_ASSIGN, locked=true, bypass=LOG-REDACTED > MoveRegionProcedure hri=f5f9ff1e4b0f2d9555dabfcca71df568, > source=va1002.halxg.cloudera.com,22101,1539368318649, > destination=vd1021.halxg.cloudera.com,22101,1539368317897 > 2018-10-12 16:29:53,954 INFO org.apache.hadoop.hbase.procedure2.Procedure: > pid=411981, state=RUNNABLE:MOVE_REGION_ASSIGN, locked=true, > bypass=LOG-REDACTED MoveRegionProcedure hri=f5f9ff1e4b0f2d9555dabfcca71df568, > source=va1002.halxg.cloudera.com,22101,1539368318649, > destination=vd1021.halxg.cloudera.com,22101,1539368317897 bypassed, returning > null to finish it > 2018-10-12 16:29:53,956 INFO > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=411982, > ppid=411981, state=SUCCESS, bypass=LOG-REDACTED UnassignProcedure > table=IntegrationTestBigLinkedList_20180709093726, > region=f5f9ff1e4b0f2d9555dabfcca71df568, override=true, > server=va1002.halxg.cloudera.com,22101,1539368318649 in 3hrs, 49mins, > 12.419sec, unfinishedSiblingCount=0 > 2018-10-12 16:29:54,058 INFO > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=411981, > state=SUCCESS, bypass=LOG-REDACTED MoveRegionProcedure > hri=f5f9ff1e4b0f2d9555dabfcca71df568, > source=va1002.halxg.cloudera.com,22101,1539368318649, > destination=vd1021.halxg.cloudera.com,22101,1539368317897 in 3hrs, 49mins, > 12.878sec > 2018-10-12 16:29:54,059 INFO > org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: xlock for > pid=412210, ppid=411983, state=RUNNABLE:REGION_TRANSITION_QUEUE; > AssignProcedure table=IntegrationTestBigLinkedList_20180709093726, > region=f5f9ff1e4b0f2d9555dabfcca71df568 > 2018-10-12 16:29:54,105 WARN > org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure: > f5f9ff1e4b0f2d9555dabfcca71df568 owned by pid=411982, CANNOT run 'this' > (pid=412210). > .... > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)