[
https://issues.apache.org/jira/browse/HBASE-20152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391634#comment-16391634
]
stack commented on HBASE-20152:
-------------------------------
Thanks for taking a look.
bq. should unassign abandon....
Trying to think through abandon... Would work in the small but hard part is
how to fail compound procedures like Move and Disable Table... Split, Merge..
..
bq. or possibly declare itself successful....
That was the original patch in TestRSGroups. My subsequent worry was dataloss.
* Server crashes. Was carrying heavy in-memory load of edits for region X when
it went down; i.e. needs WAL replay.
* Notice of expiration which schedules an SCP and removes the server from
online list.
* A Move starts (a move is an unassign and an assign).
* Move Unassign fails its RPC and fails expiration of server (it is already
ongoing.... SCP is busy splitting logs).
* Move does its Assign... Region is onlined before WAL splitting completes.
bq. If there's a move_region scheduled while a server is down or is going
down... can we fail the move?
Yeah, there are a few simple checks we need to add... server down, table
enabled/disabled. Let me add these.
Was trying to think through the corner cases.
Thanks for help.
> [AMv2] DisableTableProcedure versus ServerCrashProcedure
> --------------------------------------------------------
>
> Key: HBASE-20152
> URL: https://issues.apache.org/jira/browse/HBASE-20152
> Project: HBase
> Issue Type: Bug
> Components: amv2
> Reporter: stack
> Assignee: stack
> Priority: Major
>
> Seeing a small spate of issues where disabled tables/regions are being
> assigned. Usually they happen when a DisableTableProcedure is running
> concurrent with a ServerCrashProcedure. See below. See associated
> HBASE-20131. This is umbrella issue for fixing.
> h3. Deadlock
> From HBASE-20137, 'TestRSGroups is Flakey',
> https://issues.apache.org/jira/browse/HBASE-20137?focusedCommentId=16390325&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16390325
> {code}
> * SCP is running because a server was aborted in test.
> * SCP starts AssignProcedure of region X from crashed server.
> * DisableTable Procedure runs because test has finished and we're doing
> table delete. Queues
> * UnassignProcedure for region X.
> * Disable Unassign gets Lock on region X first.
> * SCP AssignProcedure tries to get lock, waits on lock.
> * DisableTable Procedure UnassignProcedure RPC fails because server is down
> (Thats why the SCP).
> * Tries to expire the server it failed the RPC against. Fails (currently
> being SCP'd).
> * DisableTable Procedure Unassign is suspended. It is a suspend with lock on
> region X held
> * SCP can't run because lock on X is held
> * Test timesout.
> {code}
> h3. Delete of online Regions
> Saw this in nightly failure #452 for branch-2 in
> TestSplitTransactionOnCluster.org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster
> {code}
> * DisableTableProcedure is queued before SCP.
> * DisableTableProcedure Unassign fails because can't RPC to crashed server
> and can't expire.
> * Unassign is Stuck in suspend.
> * SCP runs and cleans up suspended Disable Unassign.
> * SCP completes which includes assign of Disable Unassign region.
> * Disable Unassign completes
> * Disable completes.
> * A scheduled Drop Table Procedure runs (its end of test).
> * Succeeds deleting regions that are actually assigned (see above where SCP
> assigned region).
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)