[
https://issues.apache.org/jira/browse/HBASE-21095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16589799#comment-16589799
]
Duo Zhang commented on HBASE-21095:
-----------------------------------
OK, this is a retry problem. It is rare but could happen. This is the code
snippet where we interrupt the TRSP
{code:title=TransitRegionStateProcedure.java}
public void serverCrashed(MasterProcedureEnv env, RegionStateNode regionNode,
ServerName serverName) throws IOException {
// Notice that, in this method, we do not change the procedure state,
instead, we update the
// region state in hbase:meta. This is because that, the procedure state
change will not be
// persisted until the region is woken up and finish one step, if we crash
before that then the
// information will be lost. So here we will update the region state in
hbase:meta, and when the
// procedure is woken up, it will process the error and jump to the correct
procedure state.
RegionStateTransitionState currentState = getCurrentState();
LOG.info("=============" + currentState + " : " + serverName + " : " +
regionNode);
switch (currentState) {
case REGION_STATE_TRANSITION_CLOSE:
case REGION_STATE_TRANSITION_CONFIRM_CLOSED:
case REGION_STATE_TRANSITION_CONFIRM_OPENED:
// for these 3 states, the region may still be online on the crashed
server
if (serverName.equals(regionNode.getRegionLocation())) {
env.getAssignmentManager().regionClosed(regionNode, false);
if (currentState !=
RegionStateTransitionState.REGION_STATE_TRANSITION_CLOSE) {
regionNode.getProcedureEvent().wake(env.getProcedureScheduler());
}
}
break;
default:
// If the procedure is in other 2 states, then actually we should not
arrive here, as we
// know that the region is not online on any server, so we need to do
nothing... But anyway
// let's add a log here
LOG.warn("{} received unexpected server crash call for region {} from
{}", this, regionNode,
serverName);
}
}
{code}
This is the first time we call this method, and unfortunately we are shutting
down the HMaster at the same time, so we failed to update meta.
{noformat}
2018-08-23 14:50:00,173 DEBUG [M:0;zhangduo-ubuntu:37875]
master.ActiveMasterManager(280): master:37875-0x165658bc1370000,
quorum=localhost:58702, baseZNode=/hbase Failed delete of our master address
node; KeeperErrorCode = NoNode for /hbase/master
2018-08-23 14:50:00,174 INFO [PEWorker-10]
procedure.ServerCrashProcedure(369): pid=13,
state=RUNNABLE:SERVER_CRASH_ASSIGN, hasLock=true; ServerCrashProcedure
server=zhangduo-ubuntu,39413,1535006984122, splitWal=true, meta=false found RIT
pid=14, state=WAITING_TIMEOUT:REGION_STATE_TRANSITION_CLOSE, hasLock=true;
TransitRegionStateProcedure table=Backoff,
region=61ec91e69157b1abb3244ae089774b15, REOPEN/MOVE; rit=CLOSING,
location=zhangduo-ubuntu,39413,1535006984122, table=Backoff,
region=61ec91e69157b1abb3244ae089774b15
2018-08-23 14:50:00,174 INFO [PEWorker-10]
assignment.TransitRegionStateProcedure(439):
=============REGION_STATE_TRANSITION_CLOSE :
zhangduo-ubuntu,39413,1535006984122 : rit=CLOSING,
location=zhangduo-ubuntu,39413,1535006984122, table=Backoff,
region=61ec91e69157b1abb3244ae089774b15
2018-08-23 14:50:00,174 INFO [PEWorker-10] assignment.RegionStateStore(199):
pid=14 updating hbase:meta row=61ec91e69157b1abb3244ae089774b15,
regionState=ABNORMALLY_CLOSED
2018-08-23 14:50:00,174 ERROR [PEWorker-10] assignment.RegionStateStore(212):
FAILED persisting region=61ec91e69157b1abb3244ae089774b15
state=ABNORMALLY_CLOSED
org.apache.hadoop.hbase.DoNotRetryIOException: hconnection-0x1c924321 closed
at
org.apache.hadoop.hbase.client.ConnectionImplementation.checkClosed(ConnectionImplementation.java:559)
at
org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:744)
at
org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.locateRegion(ConnectionUtils.java:1)
at
org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:738)
at
org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.locateRegion(ConnectionUtils.java:1)
at
org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:709)
at
org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.locateRegion(ConnectionUtils.java:1)
at
org.apache.hadoop.hbase.client.ConnectionImplementation.getRegionLocation(ConnectionImplementation.java:587)
at
org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.getRegionLocation(ConnectionUtils.java:1)
at
org.apache.hadoop.hbase.client.HRegionLocator.getRegionLocation(HRegionLocator.java:72)
at
org.apache.hadoop.hbase.client.RegionServerCallable.prepare(RegionServerCallable.java:223)
at
org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:105)
at org.apache.hadoop.hbase.client.HTable.put(HTable.java:557)
at
org.apache.hadoop.hbase.master.assignment.RegionStateStore.updateRegionLocation(RegionStateStore.java:206)
at
org.apache.hadoop.hbase.master.assignment.RegionStateStore.updateUserRegionLocation(RegionStateStore.java:200)
at
org.apache.hadoop.hbase.master.assignment.RegionStateStore.updateRegionLocation(RegionStateStore.java:138)
at
org.apache.hadoop.hbase.master.assignment.AssignmentManager.regionClosed(AssignmentManager.java:1489)
at
org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.serverCrashed(TransitRegionStateProcedure.java:446)
at
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.assignRegions(ServerCrashProcedure.java:370)
at
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:172)
at
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:1)
at
org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:189)
at
org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:873)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1577)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1365)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$7(ProcedureExecutor.java:1303)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1877)
{noformat}
And this is the second time we call this method
{noformat}
2018-08-23 14:50:00,186 INFO [PEWorker-10]
assignment.TransitRegionStateProcedure(439):
=============REGION_STATE_TRANSITION_CLOSE :
zhangduo-ubuntu,39413,1535006984122 : rit=ABNORMALLY_CLOSED, location=null,
table=Backoff, region=61ec91e69157b1abb3244ae089774b15
{noformat}
You can see that the region location is null! So we will skip updating meta,
and cause the TRSP to retry forever...
I think the problem here is that, we need to update the regionNode before
updating meta, but if we failed to updating meta, we need to revert the changes
of the regionNode. Will prepare a patch soon.
> The timeout retry logic for several procedures are broken after master
> restarts
> -------------------------------------------------------------------------------
>
> Key: HBASE-21095
> URL: https://issues.apache.org/jira/browse/HBASE-21095
> Project: HBase
> Issue Type: Sub-task
> Components: amv2, proc-v2
> Reporter: Duo Zhang
> Assignee: Duo Zhang
> Priority: Critical
> Fix For: 3.0.0, 2.2.0, 2.1.1, 2.0.2
>
> Attachments: HBASE-21095-branch-2.0.patch, HBASE-21095.patch
>
>
> For TRSP, and also RTP in branch-2.0 and branch-2.1, if we fail to assign or
> unassign a region, we will set the procedure to WAITING_TIMEOUT state, and
> rely on the ProcedureEvent in RegionStateNode to wake us up later. But after
> restarting, we do not suspend the ProcedureEvent in RSN, and also do not add
> the procedure to the ProcedureEvent's suspending queue, so we will hang there
> forever as no one will wake us up.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)