[
https://issues.apache.org/jira/browse/HBASE-20634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16496740#comment-16496740
]
stack edited comment on HBASE-20634 at 5/31/18 3:55 PM:
--------------------------------------------------------
bq. Add a SERVER_CRASH_HANDLE_RIT2 state can not solve the problem, even if we
do not hold the region lock on MoveRegionProcedure. It is possible that the
MoveRegionProcedure is created but never scheduled. And after we execute
SERVER_CRASH_HANDLE_RIT2 the UnassignProcedure is scheduled and then stuck...
Preflight checks should catch the case where we try to schedule a Procedure
against a downed server; i.e. when the MoveProcedure runs in your scenario, it
would fail because the SCP had finished up and the server would have been
marked as being safely offline/processed (but there might be a small window in
which it could get scheduled... yes.... The approach here is better for this
reason). Looking at MoveProcedure and UnassignProcedure... they could also do
with more checking before launch. Let me include in this patch. And
SERVER_CRASH_HANDLE_RIT2 has been removed in this patch in favor of a more
targeted check.
bq. The problem is that whether we should fail the procedure.
Failing a procedure in the preparation is fine. Post-prep after the Procedure
has started, failing it becomes complicated. Rollback is complicated and not
fully fleshed out for many Procedure steps. Given this, philosophy has been
trying to 'complete' the Procedure though in an Unassign case, say, it means
the Unassign may have been done by another concurrent Procedure (e.g. SCP).
was (Author: stack):
bq. Add a SERVER_CRASH_HANDLE_RIT2 state can not solve the problem, even if we
do not hold the region lock on MoveRegionProcedure. It is possible that the
MoveRegionProcedure is created but never scheduled. And after we execute
SERVER_CRASH_HANDLE_RIT2 the UnassignProcedure is scheduled and then stuck...
Preflight checks should catch the case where we try to schedule a Procedure
against a downed server. Looking at MoveProcedure and UnassignProcedure... they
could do with more checking before launch. Let me include in this patch. And
SERVER_CRASH_HANDLE_RIT2 has been removed in this patch in favor of a more
targeted check.
bq. The problem is that whether we should fail the procedure.
Failing a procedure in the preparation is fine. Post-prep after the Procedure
has started, failing it becomes complicated. Rollback is complicated and not
fully fleshed out for many Procedure steps. Given this, philosophy has been
trying to 'complete' the Procedure though in an Unassign case, say, it means
the Unassign may have been done by another concurrent Procedure (e.g. SCP).
> Reopen region while server crash can cause the procedure to be stuck
> --------------------------------------------------------------------
>
> Key: HBASE-20634
> URL: https://issues.apache.org/jira/browse/HBASE-20634
> Project: HBase
> Issue Type: Bug
> Reporter: Duo Zhang
> Assignee: stack
> Priority: Critical
> Fix For: 3.0.0, 2.1.0, 2.0.1
>
> Attachments: HBASE-20634-UT.patch, HBASE-20634.branch-2.0.001.patch
>
>
> Found this when implementing HBASE-20424, where we will transit the peer sync
> replication state while there is server crash.
> The problem is that, in ServerCrashAssign, we do not have the region lock, so
> it is possible that after we call handleRIT to clear the existing
> assign/unassign procedures related to this rs, and before we schedule the
> assign procedures, it is possible that that we schedule a unassign procedure
> for a region on the crashed rs. This procedure will not receive the
> ServerCrashException, instead, in addToRemoteDispatcher, it will find that it
> can not dispatch the remote call and then a FailedRemoteDispatchException
> will be raised. But we do not treat this exception the same with
> ServerCrashException, instead, we will try to expire the rs. Obviously the rs
> has already been marked as expired, so this is almost a no-op. Then the
> procedure will be stuck there for ever.
> A possible way to fix it is to treat FailedRemoteDispatchException the same
> with ServerCrashException, as it will be created in addToRemoteDispatcher
> only, and the only reason we can not dispatch a remote call is that the rs
> has already been dead. The nodeMap is a ConcurrentMap so I think we could use
> it as a guard.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)