[
https://issues.apache.org/jira/browse/HBASE-28522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17847004#comment-17847004
]
Duo Zhang commented on HBASE-28522:
-----------------------------------
The PR for HBASE-28582 is ready, I also created a test for it to show that it
will wait for RIT to finish before continue. PTAL.
In general, we can use the same solution for DisableTableProcedure, but we need
to change holdLock from true to false, which makes me a bit uncomfortable.
I will try to see if we do something like draining before starting to schedule
TRSP in DisableTableProcedure, so we can still keep holdLock to true in later
processing to simplify the logic. If not, let's change to use the same solution
in HBASE-28582, i.e, introduce a special CloseTableRegionsProcedure, to close
all regions for a table.
Thanks.
> UNASSIGN proc indefinitely stuck on dead rs
> -------------------------------------------
>
> Key: HBASE-28522
> URL: https://issues.apache.org/jira/browse/HBASE-28522
> Project: HBase
> Issue Type: Improvement
> Components: proc-v2, Region Assignment
> Reporter: Prathyusha
> Assignee: Prathyusha
> Priority: Critical
> Attachments: timeline.jpg
>
>
> One scenario we noticed in production -
> we had DisableTableProc and SCP almost triggered at similar time
> 2024-03-16 17:59:23,014 INFO [PEWorker-11] procedure.DisableTableProcedure -
> Set <TABLE_NAME> to state=DISABLING
> 2024-03-16 17:59:15,243 INFO [PEWorker-26] procedure.ServerCrashProcedure -
> Start pid=21592440, state=RUNNABLE:SERVER_CRASH_START, locked=true;
> ServerCrashProcedure
> <regionserver>, splitWal=true, meta=false
> DisabeTableProc creates unassign procs, and at this time ASSIGNs of SCP is
> not completed
> {{2024-03-16 17:59:23,003 DEBUG [PEWorker-40] procedure2.ProcedureExecutor -
> LOCK_EVENT_WAIT pid=21594220, ppid=21592440,
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE;
> TransitRegionStateProcedure table=<TABLE_NAME>, region=<regionhash>, ASSIGN}}
> UNASSIGN created by DisableTableProc is stuck on the dead regionserver and we
> had to manually bypass unassign of DisableTableProc and then do ASSIGN.
> If we can break the loop for UNASSIGN procedure to not retry if there is scp
> for that server, we do not need manual intervention?, at least the
> DisableTableProc can go to a rollback state?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)