[
https://issues.apache.org/jira/browse/HBASE-28522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17845087#comment-17845087
]
Prathyusha commented on HBASE-28522:
------------------------------------
The problem is currently SCP interrupts only procedures which are already set
in RegionStateNodes while the for loop is in progress
this will need to do more than that, because that doesn’t fully capture the set
of all procedures that need to be interrupted.
[~apurtell] [~vjasani] [~zhangduo]
And once above happens, we had to bypass the stuck UNASSIGN proc on dead
regionserver, but now the table state is DISABLING, mark it DISABLED manually.
But then table still has couple of regions in OPEN state! And then we had to
drop the table manually as well.
So once this hits, lot of manual steps required to bring it back to the
original state. You think we can add something to make it better?
> UNASSIGN proc indefinitely stuck on dead rs
> -------------------------------------------
>
> Key: HBASE-28522
> URL: https://issues.apache.org/jira/browse/HBASE-28522
> Project: HBase
> Issue Type: Improvement
> Components: proc-v2, Region Assignment
> Reporter: Prathyusha
> Assignee: Prathyusha
> Priority: Critical
> Attachments: timeline.jpg
>
>
> One scenario we noticed in production -
> we had DisableTableProc and SCP almost triggered at similar time
> 2024-03-16 17:59:23,014 INFO [PEWorker-11] procedure.DisableTableProcedure -
> Set <TABLE_NAME> to state=DISABLING
> 2024-03-16 17:59:15,243 INFO [PEWorker-26] procedure.ServerCrashProcedure -
> Start pid=21592440, state=RUNNABLE:SERVER_CRASH_START, locked=true;
> ServerCrashProcedure
> <regionserver>, splitWal=true, meta=false
> DisabeTableProc creates unassign procs, and at this time ASSIGNs of SCP is
> not completed
> {{2024-03-16 17:59:23,003 DEBUG [PEWorker-40] procedure2.ProcedureExecutor -
> LOCK_EVENT_WAIT pid=21594220, ppid=21592440,
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE;
> TransitRegionStateProcedure table=<TABLE_NAME>, region=<regionhash>, ASSIGN}}
> UNASSIGN created by DisableTableProc is stuck on the dead regionserver and we
> had to manually bypass unassign of DisableTableProc and then do ASSIGN.
> If we can break the loop for UNASSIGN procedure to not retry if there is scp
> for that server, we do not need manual intervention?, at least the
> DisableTableProc can go to a rollback state?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)