[
https://issues.apache.org/jira/browse/HBASE-20657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16562239#comment-16562239
]
Josh Elser commented on HBASE-20657:
------------------------------------
{quote}To reproduce the problem - make MTP holding the lock (as the patch does)
and run the provided test. Everything will end up with a number of regions
stuck in RiT state forever
{quote}
Nice!
{code:java}
diff --git
a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/MasterProcedureScheduler.java
b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/MasterProcedureScheduler.java
index 69a6e8f..52217f1 100644
---
a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/MasterProcedureScheduler.java
+++
b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/MasterProcedureScheduler.java
@@ -197,7 +197,8 @@ public class MasterProcedureScheduler extends
AbstractProcedureScheduler {
// check if the next procedure is still a child.
// if not, remove the rq from the fairq and go back to the xlock state
Procedure<?> nextProc = rq.peek();
- if (nextProc != null && !Procedure.haveSameParent(nextProc, pollResult))
{
+ if (nextProc != null && !Procedure.haveSameParent(nextProc, pollResult)
+ && nextProc.getRootProcId() != pollResult.getRootProcId()) {
removeFromRunQueue(fairq, rq);
}
}{code}
I don't know a reason why this shouldn't also be applied to 2.x, but maybe
[~zghaobac] or [~Apache9] know of a reason (after looking at HBASE-20569) that
we don't want this change in 2.x?
> Retrying RPC call for ModifyTableProcedure may get stuck
> --------------------------------------------------------
>
> Key: HBASE-20657
> URL: https://issues.apache.org/jira/browse/HBASE-20657
> Project: HBase
> Issue Type: Bug
> Components: Client, proc-v2
> Affects Versions: 2.0.0
> Reporter: Sergey Soldatov
> Assignee: stack
> Priority: Critical
> Fix For: 3.0.0, 2.0.2
>
> Attachments: HBASE-20657-1-branch-2.patch,
> HBASE-20657-2-branch-2.patch, HBASE-20657-3-branch-2.patch,
> HBASE-20657-4-master.patch, HBASE-20657-testcase-branch2.patch
>
>
> Env: 2 masters, 1 RS.
> Steps to reproduce: Active master is killed while ModifyTableProcedure is
> executed.
> If the table has enough regions it may come that when the secondary master
> get active some of the regions may be closed, so once client retries the call
> to the new active master, a new ModifyTableProcedure is created and get stuck
> during MODIFY_TABLE_REOPEN_ALL_REGIONS state handling. That happens because:
> 1. When we are retrying from client side, we call modifyTableAsync which
> create a procedure with a new nonce key:
> {noformat}
> ModifyTableRequest request =
> RequestConverter.buildModifyTableRequest(
> td.getTableName(), td, ng.getNonceGroup(), ng.newNonce());
> {noformat}
> So on the server side, it's considered as a new procedure and starts
> executing immediately.
> 2. When we are processing MODIFY_TABLE_REOPEN_ALL_REGIONS we create
> MoveRegionProcedure for each region, but it checks whether the region is
> online (and it's not), so it fails immediately, forcing the procedure to
> restart.
> [[email protected]] saw a similar case when two concurrent ModifyTable
> procedures were running and got stuck in the similar way.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)