[jira] [Commented] (HBASE-20634) Reopen region while server crash can cause the procedure to be stuck

stack (JIRA) Mon, 04 Jun 2018 12:39:53 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-20634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16500758#comment-16500758
 ]


stack commented on HBASE-20634:
-------------------------------

Sorry about that. Here is addendum that I pushed:

{code}
diff --git 
a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/replication/RefreshPeerProcedure.java
 
b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/replication/RefreshPeerProcedure.java
index ba9bcdc02d..10e16e9a56 100644
--- 
a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/replication/RefreshPeerProcedure.java
+++ 
b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/replication/RefreshPeerProcedure.java
@@ -22,6 +22,7 @@ import org.apache.hadoop.hbase.ServerName;
 import org.apache.hadoop.hbase.master.procedure.MasterProcedureEnv;
 import org.apache.hadoop.hbase.master.procedure.PeerProcedureInterface;
 import 
org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher.ServerOperation;
+import org.apache.hadoop.hbase.procedure2.FailedRemoteDispatchException;
 import org.apache.hadoop.hbase.procedure2.Procedure;
 import org.apache.hadoop.hbase.procedure2.ProcedureEvent;
 import org.apache.hadoop.hbase.procedure2.ProcedureStateSerializer;
@@ -166,10 +167,12 @@ public class RefreshPeerProcedure extends 
Procedure<MasterProcedureEnv>
       // retry
       dispatched = false;
     }
-    if (!env.getRemoteDispatcher().addOperationToNode(targetServer, this)) {
+    try {
+      env.getRemoteDispatcher().addOperationToNode(targetServer, this);
+    } catch (FailedRemoteDispatchException frde) {
       LOG.info("Can not add remote operation for refreshing peer {} for {} to 
{}, " +
-        "this usually because the server is already dead, " +
-        "give up and mark the procedure as complete", peerId, type, 
targetServer);
+        "this is usually because the server is already dead, " +
+        "give up and mark the procedure as complete", peerId, type, 
targetServer, frde);
       return null;
     }
     dispatched = true;
{code}

If not what you want [~Apache9], shout, and I'll open new issue to fix.


> Reopen region while server crash can cause the procedure to be stuck
> --------------------------------------------------------------------
>
>                 Key: HBASE-20634
>                 URL: https://issues.apache.org/jira/browse/HBASE-20634
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Duo Zhang
>            Assignee: stack
>            Priority: Critical
>             Fix For: 3.0.0, 2.1.0, 2.0.1
>
>         Attachments: HBASE-20634-UT.patch, HBASE-20634.branch-2.0.001.patch, 
> HBASE-20634.branch-2.0.002.patch, HBASE-20634.branch-2.0.003.patch, 
> HBASE-20634.branch-2.0.004.patch, HBASE-20634.branch-2.0.005.patch, 
> HBASE-20634.branch-2.0.006.patch, HBASE-20634.branch-2.0.006.patch, 
> HBASE-20634.branch-2.0.007.patch, HBASE-20634.branch-2.0.008.patch, 
> HBASE-20634.branch-2.0.009.patch
>
>
> Found this when implementing HBASE-20424, where we will transit the peer sync 
> replication state while there is server crash.
> The problem is that, in ServerCrashAssign, we do not have the region lock, so 
> it is possible that after we call handleRIT to clear the existing 
> assign/unassign procedures related to this rs, and before we schedule the 
> assign procedures, it is possible that that we schedule a unassign procedure 
> for a region on the crashed rs. This procedure will not receive the 
> ServerCrashException, instead, in addToRemoteDispatcher, it will find that it 
> can not dispatch the remote call and then a  FailedRemoteDispatchException 
> will be raised. But we do not treat this exception the same with 
> ServerCrashException, instead, we will try to expire the rs. Obviously the rs 
> has already been marked as expired, so this is almost a no-op. Then the 
> procedure will be stuck there for ever.
> A possible way to fix it is to treat FailedRemoteDispatchException the same 
> with ServerCrashException, as it will be created in addToRemoteDispatcher 
> only, and the only reason we can not dispatch a remote call is that the rs 
> has already been dead. The nodeMap is a ConcurrentMap so I think we could use 
> it as a guard.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-20634) Reopen region while server crash can cause the procedure to be stuck

Reply via email to