[jira] [Commented] (HBASE-20173) [AMv2] DisableTableProcedure concurrent to ServerCrashProcedure can deadlock

stack (JIRA) Wed, 30 May 2018 21:20:28 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-20173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16496071#comment-16496071
 ]


stack commented on HBASE-20173:
-------------------------------

So, coming back here, this patch did NOT fix the problem described afterall. 
The patch attached helped with a similar scenario -- a region unassign being 
scheduled and failing its rpc against a server that is being SCP'd --  but it 
did not the scenario described. I failed to repro the described problem on 
cluster test and thought it too hard to manufacture the circumstance in a unit 
test but [~Apache9] recreated the scenario described above in a unit test over 
in HBASE-20634. In HBASE-20634, we make a proper fix for this problem (and undo 
the change made here since the HBASE-20634 is more comprehensive).

> [AMv2] DisableTableProcedure concurrent to ServerCrashProcedure can deadlock
> ----------------------------------------------------------------------------
>
>                 Key: HBASE-20173
>                 URL: https://issues.apache.org/jira/browse/HBASE-20173
>             Project: HBase
>          Issue Type: Sub-task
>          Components: amv2
>            Reporter: stack
>            Assignee: stack
>            Priority: Critical
>             Fix For: 2.0.0
>
>         Attachments: HBASE-20173.branch-2.001.patch, 
> HBASE-20173.branch-2.002.patch
>
>
> See 'Deadlock' scenario in parent issue. Doing as focused subtask since 
> parent has a few things going on in it.
> Let me reproduce it below:
> From HBASE-20137, 'TestRSGroups is Flakey', 
> https://issues.apache.org/jira/browse/HBASE-20137?focusedCommentId=16390325&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16390325
>  * SCP is running because a server was aborted in test.
>  * SCP starts AssignProcedure of region X from crashed server.
>  * DisableTable Procedure runs because test has finished and we're doing 
> table delete. Queues 
>  * UnassignProcedure for region X.
>  * Disable Unassign gets Lock on region X first.
>  * SCP AssignProcedure tries to get lock, waits on lock.
>  * DisableTable Procedure UnassignProcedure RPC fails because server is down 
> (Thats why the SCP).
>  * Tries to expire the server it failed the RPC against. Fails (currently 
> being SCP'd).
>  * DisableTable Procedure Unassign is suspended. It is a suspend with lock on 
> region X held
>  * SCP can't run because lock on X is held
>  * Test timesout.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-20173) [AMv2] DisableTableProcedure concurrent to ServerCrashProcedure can deadlock

Reply via email to