[
https://issues.apache.org/jira/browse/HBASE-20634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16500758#comment-16500758
]
stack commented on HBASE-20634:
-------------------------------
Sorry about that. Here is addendum that I pushed:
{code}
diff --git
a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/replication/RefreshPeerProcedure.java
b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/replication/RefreshPeerProcedure.java
index ba9bcdc02d..10e16e9a56 100644
---
a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/replication/RefreshPeerProcedure.java
+++
b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/replication/RefreshPeerProcedure.java
@@ -22,6 +22,7 @@ import org.apache.hadoop.hbase.ServerName;
import org.apache.hadoop.hbase.master.procedure.MasterProcedureEnv;
import org.apache.hadoop.hbase.master.procedure.PeerProcedureInterface;
import
org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher.ServerOperation;
+import org.apache.hadoop.hbase.procedure2.FailedRemoteDispatchException;
import org.apache.hadoop.hbase.procedure2.Procedure;
import org.apache.hadoop.hbase.procedure2.ProcedureEvent;
import org.apache.hadoop.hbase.procedure2.ProcedureStateSerializer;
@@ -166,10 +167,12 @@ public class RefreshPeerProcedure extends
Procedure<MasterProcedureEnv>
// retry
dispatched = false;
}
- if (!env.getRemoteDispatcher().addOperationToNode(targetServer, this)) {
+ try {
+ env.getRemoteDispatcher().addOperationToNode(targetServer, this);
+ } catch (FailedRemoteDispatchException frde) {
LOG.info("Can not add remote operation for refreshing peer {} for {} to
{}, " +
- "this usually because the server is already dead, " +
- "give up and mark the procedure as complete", peerId, type,
targetServer);
+ "this is usually because the server is already dead, " +
+ "give up and mark the procedure as complete", peerId, type,
targetServer, frde);
return null;
}
dispatched = true;
{code}
If not what you want [~Apache9], shout, and I'll open new issue to fix.
> Reopen region while server crash can cause the procedure to be stuck
> --------------------------------------------------------------------
>
> Key: HBASE-20634
> URL: https://issues.apache.org/jira/browse/HBASE-20634
> Project: HBase
> Issue Type: Bug
> Reporter: Duo Zhang
> Assignee: stack
> Priority: Critical
> Fix For: 3.0.0, 2.1.0, 2.0.1
>
> Attachments: HBASE-20634-UT.patch, HBASE-20634.branch-2.0.001.patch,
> HBASE-20634.branch-2.0.002.patch, HBASE-20634.branch-2.0.003.patch,
> HBASE-20634.branch-2.0.004.patch, HBASE-20634.branch-2.0.005.patch,
> HBASE-20634.branch-2.0.006.patch, HBASE-20634.branch-2.0.006.patch,
> HBASE-20634.branch-2.0.007.patch, HBASE-20634.branch-2.0.008.patch,
> HBASE-20634.branch-2.0.009.patch
>
>
> Found this when implementing HBASE-20424, where we will transit the peer sync
> replication state while there is server crash.
> The problem is that, in ServerCrashAssign, we do not have the region lock, so
> it is possible that after we call handleRIT to clear the existing
> assign/unassign procedures related to this rs, and before we schedule the
> assign procedures, it is possible that that we schedule a unassign procedure
> for a region on the crashed rs. This procedure will not receive the
> ServerCrashException, instead, in addToRemoteDispatcher, it will find that it
> can not dispatch the remote call and then a FailedRemoteDispatchException
> will be raised. But we do not treat this exception the same with
> ServerCrashException, instead, we will try to expire the rs. Obviously the rs
> has already been marked as expired, so this is almost a no-op. Then the
> procedure will be stuck there for ever.
> A possible way to fix it is to treat FailedRemoteDispatchException the same
> with ServerCrashException, as it will be created in addToRemoteDispatcher
> only, and the only reason we can not dispatch a remote call is that the rs
> has already been dead. The nodeMap is a ConcurrentMap so I think we could use
> it as a guard.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)