[
https://issues.apache.org/jira/browse/HBASE-20634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
stack updated HBASE-20634:
--------------------------
Status: Patch Available (was: Open)
.001 See below.
{code}
HBASE-20634 Reopen region while server crash can cause the procedure to be
stuck
A reattempt at fixing HBASE-20173 [AMv2] DisableTableProcedure concurrent
to ServerCrashProcedure can deadlock
The scenario is a SCP after processing WALs, goes to assign regions that
were on the crashed server but a concurrent Procedure gets in there
first and tries to unassign a region that was on the crashed server
(could be part of a move procedure or a disable table, etc.). The
unassign happens to run AFTER SCP has released all RPCs that
were going against the crashed server. The unassign fails because the
server is crashed. The unassign used to suspend itself only it would
never be woken up because the server it was going against had already
been processed. Worse, the SCP could not make progress because the
unassign was suspended with the lock on a region that it wanted to
assign held making it so it could make no progress.
In here, we add to the unassign recognition of the state where it is
running post SCP cleanup of RPCs. If present, unassign moves to finish
instead of suspending itself.
Includes a nice unit test made by Duo Zhang that reproduces nicely the
hung scenario.
M
hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/FailedRemoteDispatchException.java
Moved this class back to hbase-procedure where it belongs.
M
hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/NoNodeDispatchException.java
M
hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/NoServerDispatchException.java
M
hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/NullTargetServerDispatchException.java
Specializiations on FRDE so we can be more particular when we say there
was a problem.
M
hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/RemoteProcedureDispatcher.java
Change addOperationToNode so we throw exceptions that give more detail
on issue rather than a mysterious true/false
M hbase-protocol-shaded/src/main/protobuf/MasterProcedure.proto
Undo SERVER_CRASH_HANDLE_RIT2. Bad idea (from HBASE-20173)
M
hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
Have expireServer return true if it actually queued an expiration. Used
later in this patch.
M
hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java
Hide methods that shouldn't be public. Add a particular check used out
in unassign procedure failure processing.
M
hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/MoveRegionProcedure.java
Check that server we're to move from is actually online (might
catch a few silly move requests early).
M
hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionStates.java
Add doc on ServerState. Wasn't being used really. Now we actually stamp
a Server OFFLINE after its WAL has been split. Means its safe to assign
since all WALs have been processed. Add methods to update SPLITTING
and to set it to OFFLINE after splitting done.
M
hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionTransitionProcedure.java
Change logging to be new-style and less repetitive of info.
Cater to new way in which .addOperationToNode returns info (exceptions
rather than true/false).
M
hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/UnassignProcedure.java
Add looking for the case where we failed assign AND we should not
suspend because we will never be woken up because SCP is beyond
doing this for all stuck RPCs.
Some cleanup of the failure processing grouping where we can proceed.
TODOs have been handled in this refactor including the TODO that
wonders if it possible that there are concurrent fails coming in
(Yes).
M
hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashProcedure.java
Doc and removing the old HBASE-20173 'fix'.
Also updating ServerStateNode post WAL splitting so it gets marked
OFFLINE.
A
hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestServerCrashProcedureStuck.java
Nice test by Duo Zhang.
{code}
> Reopen region while server crash can cause the procedure to be stuck
> --------------------------------------------------------------------
>
> Key: HBASE-20634
> URL: https://issues.apache.org/jira/browse/HBASE-20634
> Project: HBase
> Issue Type: Bug
> Reporter: Duo Zhang
> Assignee: stack
> Priority: Critical
> Fix For: 3.0.0, 2.1.0, 2.0.1
>
> Attachments: HBASE-20634-UT.patch, HBASE-20634.branch-2.0.001.patch
>
>
> Found this when implementing HBASE-20424, where we will transit the peer sync
> replication state while there is server crash.
> The problem is that, in ServerCrashAssign, we do not have the region lock, so
> it is possible that after we call handleRIT to clear the existing
> assign/unassign procedures related to this rs, and before we schedule the
> assign procedures, it is possible that that we schedule a unassign procedure
> for a region on the crashed rs. This procedure will not receive the
> ServerCrashException, instead, in addToRemoteDispatcher, it will find that it
> can not dispatch the remote call and then a FailedRemoteDispatchException
> will be raised. But we do not treat this exception the same with
> ServerCrashException, instead, we will try to expire the rs. Obviously the rs
> has already been marked as expired, so this is almost a no-op. Then the
> procedure will be stuck there for ever.
> A possible way to fix it is to treat FailedRemoteDispatchException the same
> with ServerCrashException, as it will be created in addToRemoteDispatcher
> only, and the only reason we can not dispatch a remote call is that the rs
> has already been dead. The nodeMap is a ConcurrentMap so I think we could use
> it as a guard.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)