[
https://issues.apache.org/jira/browse/HBASE-28522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17845085#comment-17845085
]
Prathyusha edited comment on HBASE-28522 at 5/9/24 5:59 PM:
------------------------------------------------------------
[~zhangduo] yes, this is exactly the condition which I was trying to describe
in my comment above (sorry if I was unclear a bit), here is the below sequence
of events happend, ending in a state of stuck procedures and bypass was the
only way out.
fyi [~apurtell] [~vjasani]
!timeline.jpg!
was (Author: prathyu6):
[~zhangduo] yes, this is exactly the condition which I was trying to describe
in my comment above (sorry if I was unclear a bit), here is the below sequence
of events happend, ending in a state of stuck procedures and bypass was the
only way out. !timeline.jpg!
> UNASSIGN proc indefinitely stuck on dead rs
> -------------------------------------------
>
> Key: HBASE-28522
> URL: https://issues.apache.org/jira/browse/HBASE-28522
> Project: HBase
> Issue Type: Improvement
> Components: proc-v2, Region Assignment
> Reporter: Prathyusha
> Assignee: Prathyusha
> Priority: Critical
> Attachments: timeline.jpg
>
>
> One scenario we noticed in production -
> we had DisableTableProc and SCP almost triggered at similar time
> 2024-03-16 17:59:23,014 INFO [PEWorker-11] procedure.DisableTableProcedure -
> Set <TABLE_NAME> to state=DISABLING
> 2024-03-16 17:59:15,243 INFO [PEWorker-26] procedure.ServerCrashProcedure -
> Start pid=21592440, state=RUNNABLE:SERVER_CRASH_START, locked=true;
> ServerCrashProcedure
> <regionserver>, splitWal=true, meta=false
> DisabeTableProc creates unassign procs, and at this time ASSIGNs of SCP is
> not completed
> {{2024-03-16 17:59:23,003 DEBUG [PEWorker-40] procedure2.ProcedureExecutor -
> LOCK_EVENT_WAIT pid=21594220, ppid=21592440,
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE;
> TransitRegionStateProcedure table=<TABLE_NAME>, region=<regionhash>, ASSIGN}}
> UNASSIGN created by DisableTableProc is stuck on the dead regionserver and we
> had to manually bypass unassign of DisableTableProc and then do ASSIGN.
> If we can break the loop for UNASSIGN procedure to not retry if there is scp
> for that server, we do not need manual intervention?, at least the
> DisableTableProc can go to a rollback state?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)