[
https://issues.apache.org/jira/browse/HBASE-20634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
stack resolved HBASE-20634.
---------------------------
Resolution: Fixed
Pushed addendum. Re-resolving.
> Reopen region while server crash can cause the procedure to be stuck
> --------------------------------------------------------------------
>
> Key: HBASE-20634
> URL: https://issues.apache.org/jira/browse/HBASE-20634
> Project: HBase
> Issue Type: Bug
> Reporter: Duo Zhang
> Assignee: stack
> Priority: Critical
> Fix For: 3.0.0, 2.1.0, 2.0.1
>
> Attachments: HBASE-20634-UT.patch, HBASE-20634.branch-2.0.001.patch,
> HBASE-20634.branch-2.0.002.patch, HBASE-20634.branch-2.0.003.patch,
> HBASE-20634.branch-2.0.004.patch, HBASE-20634.branch-2.0.005.patch,
> HBASE-20634.branch-2.0.006.patch, HBASE-20634.branch-2.0.006.patch,
> HBASE-20634.branch-2.0.007.patch, HBASE-20634.branch-2.0.008.patch,
> HBASE-20634.branch-2.0.009.patch
>
>
> Found this when implementing HBASE-20424, where we will transit the peer sync
> replication state while there is server crash.
> The problem is that, in ServerCrashAssign, we do not have the region lock, so
> it is possible that after we call handleRIT to clear the existing
> assign/unassign procedures related to this rs, and before we schedule the
> assign procedures, it is possible that that we schedule a unassign procedure
> for a region on the crashed rs. This procedure will not receive the
> ServerCrashException, instead, in addToRemoteDispatcher, it will find that it
> can not dispatch the remote call and then a FailedRemoteDispatchException
> will be raised. But we do not treat this exception the same with
> ServerCrashException, instead, we will try to expire the rs. Obviously the rs
> has already been marked as expired, so this is almost a no-op. Then the
> procedure will be stuck there for ever.
> A possible way to fix it is to treat FailedRemoteDispatchException the same
> with ServerCrashException, as it will be created in addToRemoteDispatcher
> only, and the only reason we can not dispatch a remote call is that the rs
> has already been dead. The nodeMap is a ConcurrentMap so I think we could use
> it as a guard.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)