[jira] [Commented] (HBASE-21095) The timeout retry logic for several procedures are broken after master restarts

Duo Zhang (JIRA) Thu, 23 Aug 2018 00:02:41 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-21095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16589799#comment-16589799
 ]


Duo Zhang commented on HBASE-21095:
-----------------------------------

OK, this is a retry problem. It is rare but could happen. This is the code 
snippet where we interrupt the TRSP

{code:title=TransitRegionStateProcedure.java}
  public void serverCrashed(MasterProcedureEnv env, RegionStateNode regionNode,
      ServerName serverName) throws IOException {
    // Notice that, in this method, we do not change the procedure state, 
instead, we update the
    // region state in hbase:meta. This is because that, the procedure state 
change will not be
    // persisted until the region is woken up and finish one step, if we crash 
before that then the
    // information will be lost. So here we will update the region state in 
hbase:meta, and when the
    // procedure is woken up, it will process the error and jump to the correct 
procedure state.
    RegionStateTransitionState currentState = getCurrentState();
    LOG.info("=============" + currentState + " : " + serverName + " : " + 
regionNode);
    switch (currentState) {
      case REGION_STATE_TRANSITION_CLOSE:
      case REGION_STATE_TRANSITION_CONFIRM_CLOSED:
      case REGION_STATE_TRANSITION_CONFIRM_OPENED:
        // for these 3 states, the region may still be online on the crashed 
server
        if (serverName.equals(regionNode.getRegionLocation())) {
          env.getAssignmentManager().regionClosed(regionNode, false);
          if (currentState != 
RegionStateTransitionState.REGION_STATE_TRANSITION_CLOSE) {
            regionNode.getProcedureEvent().wake(env.getProcedureScheduler());
          }
        }
        break;
      default:
        // If the procedure is in other 2 states, then actually we should not 
arrive here, as we
        // know that the region is not online on any server, so we need to do 
nothing... But anyway
        // let's add a log here
        LOG.warn("{} received unexpected server crash call for region {} from 
{}", this, regionNode,
          serverName);

    }
  }
{code}

This is the first time we call this method, and unfortunately we are shutting 
down the HMaster at the same time, so we failed to update meta.
{noformat}
2018-08-23 14:50:00,173 DEBUG [M:0;zhangduo-ubuntu:37875] 
master.ActiveMasterManager(280): master:37875-0x165658bc1370000, 
quorum=localhost:58702, baseZNode=/hbase Failed delete of our master address 
node; KeeperErrorCode = NoNode for /hbase/master
2018-08-23 14:50:00,174 INFO  [PEWorker-10] 
procedure.ServerCrashProcedure(369): pid=13, 
state=RUNNABLE:SERVER_CRASH_ASSIGN, hasLock=true; ServerCrashProcedure 
server=zhangduo-ubuntu,39413,1535006984122, splitWal=true, meta=false found RIT 
pid=14, state=WAITING_TIMEOUT:REGION_STATE_TRANSITION_CLOSE, hasLock=true; 
TransitRegionStateProcedure table=Backoff, 
region=61ec91e69157b1abb3244ae089774b15, REOPEN/MOVE; rit=CLOSING, 
location=zhangduo-ubuntu,39413,1535006984122, table=Backoff, 
region=61ec91e69157b1abb3244ae089774b15
2018-08-23 14:50:00,174 INFO  [PEWorker-10] 
assignment.TransitRegionStateProcedure(439): 
=============REGION_STATE_TRANSITION_CLOSE : 
zhangduo-ubuntu,39413,1535006984122 : rit=CLOSING, 
location=zhangduo-ubuntu,39413,1535006984122, table=Backoff, 
region=61ec91e69157b1abb3244ae089774b15
2018-08-23 14:50:00,174 INFO  [PEWorker-10] assignment.RegionStateStore(199): 
pid=14 updating hbase:meta row=61ec91e69157b1abb3244ae089774b15, 
regionState=ABNORMALLY_CLOSED
2018-08-23 14:50:00,174 ERROR [PEWorker-10] assignment.RegionStateStore(212): 
FAILED persisting region=61ec91e69157b1abb3244ae089774b15 
state=ABNORMALLY_CLOSED
org.apache.hadoop.hbase.DoNotRetryIOException: hconnection-0x1c924321 closed
        at 
org.apache.hadoop.hbase.client.ConnectionImplementation.checkClosed(ConnectionImplementation.java:559)
        at 
org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:744)
        at 
org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.locateRegion(ConnectionUtils.java:1)
        at 
org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:738)
        at 
org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.locateRegion(ConnectionUtils.java:1)
        at 
org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:709)
        at 
org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.locateRegion(ConnectionUtils.java:1)
        at 
org.apache.hadoop.hbase.client.ConnectionImplementation.getRegionLocation(ConnectionImplementation.java:587)
        at 
org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.getRegionLocation(ConnectionUtils.java:1)
        at 
org.apache.hadoop.hbase.client.HRegionLocator.getRegionLocation(HRegionLocator.java:72)
        at 
org.apache.hadoop.hbase.client.RegionServerCallable.prepare(RegionServerCallable.java:223)
        at 
org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:105)
        at org.apache.hadoop.hbase.client.HTable.put(HTable.java:557)
        at 
org.apache.hadoop.hbase.master.assignment.RegionStateStore.updateRegionLocation(RegionStateStore.java:206)
        at 
org.apache.hadoop.hbase.master.assignment.RegionStateStore.updateUserRegionLocation(RegionStateStore.java:200)
        at 
org.apache.hadoop.hbase.master.assignment.RegionStateStore.updateRegionLocation(RegionStateStore.java:138)
        at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.regionClosed(AssignmentManager.java:1489)
        at 
org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.serverCrashed(TransitRegionStateProcedure.java:446)
        at 
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.assignRegions(ServerCrashProcedure.java:370)
        at 
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:172)
        at 
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:1)
        at 
org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:189)
        at 
org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:873)
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1577)
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1365)
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$7(ProcedureExecutor.java:1303)
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1877)
{noformat}

And this is the second time we call this method
{noformat}
2018-08-23 14:50:00,186 INFO  [PEWorker-10] 
assignment.TransitRegionStateProcedure(439): 
=============REGION_STATE_TRANSITION_CLOSE : 
zhangduo-ubuntu,39413,1535006984122 : rit=ABNORMALLY_CLOSED, location=null, 
table=Backoff, region=61ec91e69157b1abb3244ae089774b15
{noformat}

You can see that the region location is null! So we will skip updating meta, 
and cause the TRSP to retry forever...

I think the problem here is that, we need to update the regionNode before 
updating meta, but if we failed to updating meta, we need to revert the changes 
of the regionNode. Will prepare a patch soon.


> The timeout retry logic for several procedures are broken after master 
> restarts
> -------------------------------------------------------------------------------
>
>                 Key: HBASE-21095
>                 URL: https://issues.apache.org/jira/browse/HBASE-21095
>             Project: HBase
>          Issue Type: Sub-task
>          Components: amv2, proc-v2
>            Reporter: Duo Zhang
>            Assignee: Duo Zhang
>            Priority: Critical
>             Fix For: 3.0.0, 2.2.0, 2.1.1, 2.0.2
>
>         Attachments: HBASE-21095-branch-2.0.patch, HBASE-21095.patch
>
>
> For TRSP, and also RTP in branch-2.0 and branch-2.1, if we fail to assign or 
> unassign a region, we will set the procedure to WAITING_TIMEOUT state, and 
> rely on the ProcedureEvent in RegionStateNode to wake us up later. But after 
> restarting, we do not suspend the ProcedureEvent in RSN, and also do not add 
> the procedure to the ProcedureEvent's suspending queue, so we will hang there 
> forever as no one will wake us up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21095) The timeout retry logic for several procedures are broken after master restarts

Reply via email to