stack created HBASE-21307:
-----------------------------

             Summary: [amv2] Deadlock when we move a Region from a not-online 
RegionServer
                 Key: HBASE-21307
                 URL: https://issues.apache.org/jira/browse/HBASE-21307
             Project: HBase
          Issue Type: Bug
          Components: amv2
    Affects Versions: 2.1.1
            Reporter: stack
            Assignee: stack
             Fix For: 2.1.1


Perhaps this doesn't happen in branch-2, but its problem in branch-2.1.

Highlevel, we go to move a region, its unassign subprocedure fails its dispatch 
because the server is not online so it queues a SCP and waits on it to break 
the RPC. The SCP can't run though because the MRP holds lock on the region.

I can bypass the MRP but then the SCP fails because Region is 'owned' by the 
MRP. See below:

{code}
2018-10-12 16:29:53,423 INFO 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Begin bypass pid=411982, 
ppid=411981, state=RUNNABLE:REGION_TRANSITION_DISPATCH, locked=true; 
UnassignProcedure table=IntegrationTestBigLinkedList_20180709093726, 
region=f5f9ff1e4b0f2d9555dabfcca71df568, override=true, 
server=va1002.halxg.cloudera.com,22101,1539368318649 with lockWait=0, 
override=true, recursive=true
2018-10-12 16:29:53,424 INFO 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Bypassing pid=411982, 
ppid=411981, state=RUNNABLE:REGION_TRANSITION_DISPATCH, locked=true; 
UnassignProcedure table=IntegrationTestBigLinkedList_20180709093726, 
region=f5f9ff1e4b0f2d9555dabfcca71df568, override=true, 
server=va1002.halxg.cloudera.com,22101,1539368318649
2018-10-12 16:29:53,712 INFO 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Bypassing pid=411981, 
state=WAITING:MOVE_REGION_ASSIGN, locked=true; MoveRegionProcedure 
hri=f5f9ff1e4b0f2d9555dabfcca71df568, 
source=va1002.halxg.cloudera.com,22101,1539368318649, 
destination=vd1021.halxg.cloudera.com,22101,1539368317897
2018-10-12 16:29:53,838 INFO 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Bypassing pid=411982, 
ppid=411981, state=RUNNABLE:REGION_TRANSITION_DISPATCH, locked=true, 
bypass=LOG-REDACTED UnassignProcedure 
table=IntegrationTestBigLinkedList_20180709093726, 
region=f5f9ff1e4b0f2d9555dabfcca71df568, override=true, 
server=va1002.halxg.cloudera.com,22101,1539368318649 and its ancestors 
successfully, adding to queue
2018-10-12 16:29:53,839 INFO org.apache.hadoop.hbase.procedure2.Procedure: 
pid=411982, ppid=411981, state=RUNNABLE:REGION_TRANSITION_DISPATCH, 
locked=true, bypass=LOG-REDACTED UnassignProcedure 
table=IntegrationTestBigLinkedList_20180709093726, 
region=f5f9ff1e4b0f2d9555dabfcca71df568, override=true, 
server=va1002.halxg.cloudera.com,22101,1539368318649 bypassed, returning null 
to finish it
2018-10-12 16:29:53,954 INFO 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished subprocedure 
pid=411982, resume processing parent pid=411981, 
state=RUNNABLE:MOVE_REGION_ASSIGN, locked=true, bypass=LOG-REDACTED 
MoveRegionProcedure hri=f5f9ff1e4b0f2d9555dabfcca71df568, 
source=va1002.halxg.cloudera.com,22101,1539368318649, 
destination=vd1021.halxg.cloudera.com,22101,1539368317897
2018-10-12 16:29:53,954 INFO org.apache.hadoop.hbase.procedure2.Procedure: 
pid=411981, state=RUNNABLE:MOVE_REGION_ASSIGN, locked=true, bypass=LOG-REDACTED 
MoveRegionProcedure hri=f5f9ff1e4b0f2d9555dabfcca71df568, 
source=va1002.halxg.cloudera.com,22101,1539368318649, 
destination=vd1021.halxg.cloudera.com,22101,1539368317897 bypassed, returning 
null to finish it
2018-10-12 16:29:53,956 INFO 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=411982, 
ppid=411981, state=SUCCESS, bypass=LOG-REDACTED UnassignProcedure 
table=IntegrationTestBigLinkedList_20180709093726, 
region=f5f9ff1e4b0f2d9555dabfcca71df568, override=true, 
server=va1002.halxg.cloudera.com,22101,1539368318649 in 3hrs, 49mins, 
12.419sec, unfinishedSiblingCount=0
2018-10-12 16:29:54,058 INFO 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=411981, 
state=SUCCESS, bypass=LOG-REDACTED MoveRegionProcedure 
hri=f5f9ff1e4b0f2d9555dabfcca71df568, 
source=va1002.halxg.cloudera.com,22101,1539368318649, 
destination=vd1021.halxg.cloudera.com,22101,1539368317897 in 3hrs, 49mins, 
12.878sec
2018-10-12 16:29:54,059 INFO 
org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: xlock for 
pid=412210, ppid=411983, state=RUNNABLE:REGION_TRANSITION_QUEUE; 
AssignProcedure table=IntegrationTestBigLinkedList_20180709093726, 
region=f5f9ff1e4b0f2d9555dabfcca71df568
2018-10-12 16:29:54,105 WARN 
org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure: 
f5f9ff1e4b0f2d9555dabfcca71df568 owned by pid=411982, CANNOT run 'this' 
(pid=412210).
....
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to