[
https://issues.apache.org/jira/browse/HBASE-21307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
stack resolved HBASE-21307.
---------------------------
Resolution: Duplicate
Resolving as another example of HBASE-21288. Will keep an eye out to see if the
soln to HBASE-21213 causes more damage than good.
> [amv2] Deadlock when we move a Region from a not-online RegionServer
> --------------------------------------------------------------------
>
> Key: HBASE-21307
> URL: https://issues.apache.org/jira/browse/HBASE-21307
> Project: HBase
> Issue Type: Bug
> Components: amv2
> Affects Versions: 2.1.1
> Reporter: stack
> Assignee: stack
> Priority: Critical
> Fix For: 2.1.1
>
>
> Perhaps this doesn't happen in branch-2, but its problem in branch-2.1.
> Highlevel, we go to move a region, its unassign subprocedure fails its
> dispatch because the server is not online so it queues a SCP and waits on it
> to break the RPC. The SCP can't run though because the MRP holds lock on the
> region.
> I can bypass the MRP but then the SCP fails because Region is 'owned' by the
> MRP. See below:
> {code}
> 2018-10-12 16:29:53,423 INFO
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Begin bypass
> pid=411982, ppid=411981, state=RUNNABLE:REGION_TRANSITION_DISPATCH,
> locked=true; UnassignProcedure
> table=IntegrationTestBigLinkedList_20180709093726,
> region=f5f9ff1e4b0f2d9555dabfcca71df568, override=true,
> server=va1002.halxg.cloudera.com,22101,1539368318649 with lockWait=0,
> override=true, recursive=true
> 2018-10-12 16:29:53,424 INFO
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Bypassing pid=411982,
> ppid=411981, state=RUNNABLE:REGION_TRANSITION_DISPATCH, locked=true;
> UnassignProcedure table=IntegrationTestBigLinkedList_20180709093726,
> region=f5f9ff1e4b0f2d9555dabfcca71df568, override=true,
> server=va1002.halxg.cloudera.com,22101,1539368318649
> 2018-10-12 16:29:53,712 INFO
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Bypassing pid=411981,
> state=WAITING:MOVE_REGION_ASSIGN, locked=true; MoveRegionProcedure
> hri=f5f9ff1e4b0f2d9555dabfcca71df568,
> source=va1002.halxg.cloudera.com,22101,1539368318649,
> destination=vd1021.halxg.cloudera.com,22101,1539368317897
> 2018-10-12 16:29:53,838 INFO
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Bypassing pid=411982,
> ppid=411981, state=RUNNABLE:REGION_TRANSITION_DISPATCH, locked=true,
> bypass=LOG-REDACTED UnassignProcedure
> table=IntegrationTestBigLinkedList_20180709093726,
> region=f5f9ff1e4b0f2d9555dabfcca71df568, override=true,
> server=va1002.halxg.cloudera.com,22101,1539368318649 and its ancestors
> successfully, adding to queue
> 2018-10-12 16:29:53,839 INFO org.apache.hadoop.hbase.procedure2.Procedure:
> pid=411982, ppid=411981, state=RUNNABLE:REGION_TRANSITION_DISPATCH,
> locked=true, bypass=LOG-REDACTED UnassignProcedure
> table=IntegrationTestBigLinkedList_20180709093726,
> region=f5f9ff1e4b0f2d9555dabfcca71df568, override=true,
> server=va1002.halxg.cloudera.com,22101,1539368318649 bypassed, returning null
> to finish it
> 2018-10-12 16:29:53,954 INFO
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished subprocedure
> pid=411982, resume processing parent pid=411981,
> state=RUNNABLE:MOVE_REGION_ASSIGN, locked=true, bypass=LOG-REDACTED
> MoveRegionProcedure hri=f5f9ff1e4b0f2d9555dabfcca71df568,
> source=va1002.halxg.cloudera.com,22101,1539368318649,
> destination=vd1021.halxg.cloudera.com,22101,1539368317897
> 2018-10-12 16:29:53,954 INFO org.apache.hadoop.hbase.procedure2.Procedure:
> pid=411981, state=RUNNABLE:MOVE_REGION_ASSIGN, locked=true,
> bypass=LOG-REDACTED MoveRegionProcedure hri=f5f9ff1e4b0f2d9555dabfcca71df568,
> source=va1002.halxg.cloudera.com,22101,1539368318649,
> destination=vd1021.halxg.cloudera.com,22101,1539368317897 bypassed, returning
> null to finish it
> 2018-10-12 16:29:53,956 INFO
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=411982,
> ppid=411981, state=SUCCESS, bypass=LOG-REDACTED UnassignProcedure
> table=IntegrationTestBigLinkedList_20180709093726,
> region=f5f9ff1e4b0f2d9555dabfcca71df568, override=true,
> server=va1002.halxg.cloudera.com,22101,1539368318649 in 3hrs, 49mins,
> 12.419sec, unfinishedSiblingCount=0
> 2018-10-12 16:29:54,058 INFO
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=411981,
> state=SUCCESS, bypass=LOG-REDACTED MoveRegionProcedure
> hri=f5f9ff1e4b0f2d9555dabfcca71df568,
> source=va1002.halxg.cloudera.com,22101,1539368318649,
> destination=vd1021.halxg.cloudera.com,22101,1539368317897 in 3hrs, 49mins,
> 12.878sec
> 2018-10-12 16:29:54,059 INFO
> org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: xlock for
> pid=412210, ppid=411983, state=RUNNABLE:REGION_TRANSITION_QUEUE;
> AssignProcedure table=IntegrationTestBigLinkedList_20180709093726,
> region=f5f9ff1e4b0f2d9555dabfcca71df568
> 2018-10-12 16:29:54,105 WARN
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure:
> f5f9ff1e4b0f2d9555dabfcca71df568 owned by pid=411982, CANNOT run 'this'
> (pid=412210).
> ....
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)