[jira] [Comment Edited] (HBASE-28522) UNASSIGN proc indefinitely stuck on dead rs

Prathyusha (Jira) Mon, 27 May 2024 08:27:15 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-28522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17849757#comment-17849757
 ]


Prathyusha edited comment on HBASE-28522 at 5/27/24 3:25 PM:
-------------------------------------------------------------

[~zhangduo] Even if we introduce a procedure like CloseTableRegionsProcedure in 
HBASE-28582 here, even though we put a logic to wait for the current 
rit(instead of creating a new child UNASSIGN directly), every TRSP(which just 
tried to start execute) will be blocked on trying to get the shared lock on 
Table (DTP holding exclusive lock) so they wont finish right? 
Or you mean if we go via this approach
>If not, let's change to use the same solution in HBASE-28582, i.e, introduce a 
>special 
>CloseTableRegionsProcedure, to close all regions for a table.
we have the holdLock as false for Table?

An orthogonal thought - can we somehow add them(current RIT TRSPs) also as 
child procs to this? so that they can get the shared lock to table? cause 
CloseTableRegionsProcedure is anyway waiting on them to finish. 
Or if not child procs, another field like dependent procedures and those also 
have access to shared lock of the resources it holds


was (Author: prathyu6):
[~zhangduo] Even if we introduce a procedure like CloseTableRegionsProcedure in 
HBASE-28582 here, even though we put a logic to wait for the current 
rit(instead of creating a new child UNASSIGN directly), every TRSP(which just 
tried to start execute) will be blocked on trying to get the shared lock on 
Table (DTP holding exclusive lock) so they wont finish right? 
Or you mean if we go via this approach
>If not, let's change to use the same solution in HBASE-28582, i.e, introduce a 
>special 
>CloseTableRegionsProcedure, to close all regions for a table.
we have the holdLock as false for Table?

An orthogonal thought - can we somehow add them(current RIT TRSPs) also as 
child procs to this? so that they can get the shared lock to table? cause 
CloseTableRegionsProcedure is anyway waiting on them to finish. 

> UNASSIGN proc indefinitely stuck on dead rs
> -------------------------------------------
>
>                 Key: HBASE-28522
>                 URL: https://issues.apache.org/jira/browse/HBASE-28522
>             Project: HBase
>          Issue Type: Improvement
>          Components: proc-v2, Region Assignment
>            Reporter: Prathyusha
>            Assignee: Prathyusha
>            Priority: Critical
>         Attachments: timeline.jpg
>
>
> One scenario we noticed in production -
> we had DisableTableProc and SCP almost triggered at similar time
> 2024-03-16 17:59:23,014 INFO [PEWorker-11] procedure.DisableTableProcedure - 
> Set <TABLE_NAME> to state=DISABLING
> 2024-03-16 17:59:15,243 INFO [PEWorker-26] procedure.ServerCrashProcedure - 
> Start pid=21592440, state=RUNNABLE:SERVER_CRASH_START, locked=true; 
> ServerCrashProcedure 
> <regionserver>, splitWal=true, meta=false
> DisabeTableProc creates unassign procs, and at this time ASSIGNs of SCP is 
> not completed
> {{2024-03-16 17:59:23,003 DEBUG [PEWorker-40] procedure2.ProcedureExecutor - 
> LOCK_EVENT_WAIT pid=21594220, ppid=21592440, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; 
> TransitRegionStateProcedure table=<TABLE_NAME>, region=<regionhash>, ASSIGN}}
> UNASSIGN created by DisableTableProc is stuck on the dead regionserver and we 
> had to manually bypass unassign of DisableTableProc and then do ASSIGN.
> If we can break the loop for UNASSIGN procedure to not retry if there is scp 
> for that server, we do not need manual intervention?, at least the 
> DisableTableProc can go to a rollback state?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (HBASE-28522) UNASSIGN proc indefinitely stuck on dead rs

Reply via email to