[ 
https://issues.apache.org/jira/browse/HBASE-20657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16562239#comment-16562239
 ] 

Josh Elser commented on HBASE-20657:
------------------------------------

{quote}To reproduce the problem - make MTP holding the lock (as the patch does) 
and run the provided test. Everything will end up with a number of regions 
stuck in RiT state forever
{quote}
Nice!
{code:java}
diff --git 
a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/MasterProcedureScheduler.java
 
b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/MasterProcedureScheduler.java
index 69a6e8f..52217f1 100644
--- 
a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/MasterProcedureScheduler.java
+++ 
b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/MasterProcedureScheduler.java
@@ -197,7 +197,8 @@ public class MasterProcedureScheduler extends 
AbstractProcedureScheduler {
       // check if the next procedure is still a child.
       // if not, remove the rq from the fairq and go back to the xlock state
       Procedure<?> nextProc = rq.peek();
-      if (nextProc != null && !Procedure.haveSameParent(nextProc, pollResult)) 
{
+      if (nextProc != null && !Procedure.haveSameParent(nextProc, pollResult)
+          && nextProc.getRootProcId() != pollResult.getRootProcId()) {
         removeFromRunQueue(fairq, rq);
       }
     }{code}
I don't know a reason why this shouldn't also be applied to 2.x, but maybe 
[~zghaobac] or [~Apache9] know of a reason (after looking at HBASE-20569) that 
we don't want this change in 2.x?

> Retrying RPC call for ModifyTableProcedure may get stuck
> --------------------------------------------------------
>
>                 Key: HBASE-20657
>                 URL: https://issues.apache.org/jira/browse/HBASE-20657
>             Project: HBase
>          Issue Type: Bug
>          Components: Client, proc-v2
>    Affects Versions: 2.0.0
>            Reporter: Sergey Soldatov
>            Assignee: stack
>            Priority: Critical
>             Fix For: 3.0.0, 2.0.2
>
>         Attachments: HBASE-20657-1-branch-2.patch, 
> HBASE-20657-2-branch-2.patch, HBASE-20657-3-branch-2.patch, 
> HBASE-20657-4-master.patch, HBASE-20657-testcase-branch2.patch
>
>
> Env: 2 masters, 1 RS. 
> Steps to reproduce: Active master is killed while ModifyTableProcedure is 
> executed. 
> If the table has enough regions it may come that when the secondary master 
> get active some of the regions may be closed, so once client retries the call 
> to the new active master, a new ModifyTableProcedure is created and get stuck 
> during MODIFY_TABLE_REOPEN_ALL_REGIONS state handling. That happens because:
> 1. When we are retrying from client side, we call modifyTableAsync which 
> create a procedure with a new nonce key:
> {noformat}
>          ModifyTableRequest request = 
> RequestConverter.buildModifyTableRequest(
>             td.getTableName(), td, ng.getNonceGroup(), ng.newNonce());
> {noformat}
>  So on the server side, it's considered as a new procedure and starts 
> executing immediately.
> 2. When we are processing  MODIFY_TABLE_REOPEN_ALL_REGIONS we create 
> MoveRegionProcedure for each region, but it checks whether the region is 
> online (and it's not), so it fails immediately, forcing the procedure to 
> restart.
> [[email protected]] saw a similar case when two concurrent ModifyTable 
> procedures were running and got stuck in the similar way. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to