[ 
https://issues.apache.org/jira/browse/HBASE-28522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17844875#comment-17844875
 ] 

Duo Zhang commented on HBASE-28522:
-----------------------------------

After checking the code, I think a possible race is that, in SCP, we have 
schedule a TRSP to assign the region, but at the same time, 
DisableTableProcedure has unset the procedure and scheduled a new one, but 
actually, the new TRSP can not be executed because the target server is already 
dead, so the unassign will fail and wait for SCP to interrupte it, but at the 
same time, SCP is waiting the old TRSP(scheduled by SCP) to finish, so it has 
no change to interrupte the new TRSP(and even if it can execute, in the current 
logic it will just finish itself without interrupting any other TRSPs).

I think the root cause is here, in forceCreateUnssignProcedure

{code}
    regionNode.lock();
    try {
      if (regionNode.isInState(State.OFFLINE, State.CLOSED, State.SPLIT)) {
        return null;
      }
      // in general, a split parent should be in CLOSED or SPLIT state, but 
anyway, let's check it
      // here for safety
      if (regionNode.getRegionInfo().isSplit()) {
        LOG.warn("{} is a split parent but not in CLOSED or SPLIT state", 
regionNode);
        return null;
      }
      // As in DisableTableProcedure or ModifyTableProcedure, we will hold the 
xlock for table, so
      // we can make sure that this procedure has not been executed yet, as 
TRSP will hold the
      // shared lock for table all the time. So here we will unset it and when 
it is actually
      // executed, it will find that the attach procedure is not itself and 
quit immediately.
      if (regionNode.getProcedure() != null) {
        regionNode.unsetProcedure(regionNode.getProcedure());
      }
      return 
regionNode.setProcedure(TransitRegionStateProcedure.unassign(getProcedureEnvironment(),
        regionNode.getRegionInfo()));
    } finally {
      regionNode.unlock();
    }
{code}

We should reuse the same TRSP, instead of unsetting it and scheduling a new one.

Let me think if this is possible.

> UNASSIGN proc indefinitely stuck on dead rs
> -------------------------------------------
>
>                 Key: HBASE-28522
>                 URL: https://issues.apache.org/jira/browse/HBASE-28522
>             Project: HBase
>          Issue Type: Improvement
>          Components: proc-v2
>            Reporter: Prathyusha
>            Assignee: Prathyusha
>            Priority: Minor
>
> One scenario we noticed in production -
> we had DisableTableProc and SCP almost triggered at similar time
> 2024-03-16 17:59:23,014 INFO [PEWorker-11] procedure.DisableTableProcedure - 
> Set <TABLE_NAME> to state=DISABLING
> 2024-03-16 17:59:15,243 INFO [PEWorker-26] procedure.ServerCrashProcedure - 
> Start pid=21592440, state=RUNNABLE:SERVER_CRASH_START, locked=true; 
> ServerCrashProcedure 
> <regionserver>, splitWal=true, meta=false
> DisabeTableProc creates unassign procs, and at this time ASSIGNs of SCP is 
> not completed
> {{2024-03-16 17:59:23,003 DEBUG [PEWorker-40] procedure2.ProcedureExecutor - 
> LOCK_EVENT_WAIT pid=21594220, ppid=21592440, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; 
> TransitRegionStateProcedure table=<TABLE_NAME>, region=<regionhash>, ASSIGN}}
> UNASSIGN created by DisableTableProc is stuck on the dead regionserver and we 
> had to manually bypass unassign of DisableTableProc and then do ASSIGN.
> If we can break the loop for UNASSIGN procedure to not retry if there is scp 
> for that server, we do not need manual intervention?, at least the 
> DisableTableProc can go to a rollback state?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to