[ 
https://issues.apache.org/jira/browse/HBASE-28522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17845000#comment-17845000
 ] 

Duo Zhang commented on HBASE-28522:
-----------------------------------

ModifyTableProcedure may have the same problem when reducing region replica 
count, and the implementation is totally wrong as ModifyTableProcedure does not 
hold exclusive lock all the time...

Let file an issue to fix it too.

The fix is DTP is more difficult, as for DTP, we want it to hold the exclusive 
lock all the time so it will not skew with other table procedures...

> UNASSIGN proc indefinitely stuck on dead rs
> -------------------------------------------
>
>                 Key: HBASE-28522
>                 URL: https://issues.apache.org/jira/browse/HBASE-28522
>             Project: HBase
>          Issue Type: Improvement
>          Components: proc-v2, Region Assignment
>            Reporter: Prathyusha
>            Assignee: Prathyusha
>            Priority: Critical
>
> One scenario we noticed in production -
> we had DisableTableProc and SCP almost triggered at similar time
> 2024-03-16 17:59:23,014 INFO [PEWorker-11] procedure.DisableTableProcedure - 
> Set <TABLE_NAME> to state=DISABLING
> 2024-03-16 17:59:15,243 INFO [PEWorker-26] procedure.ServerCrashProcedure - 
> Start pid=21592440, state=RUNNABLE:SERVER_CRASH_START, locked=true; 
> ServerCrashProcedure 
> <regionserver>, splitWal=true, meta=false
> DisabeTableProc creates unassign procs, and at this time ASSIGNs of SCP is 
> not completed
> {{2024-03-16 17:59:23,003 DEBUG [PEWorker-40] procedure2.ProcedureExecutor - 
> LOCK_EVENT_WAIT pid=21594220, ppid=21592440, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; 
> TransitRegionStateProcedure table=<TABLE_NAME>, region=<regionhash>, ASSIGN}}
> UNASSIGN created by DisableTableProc is stuck on the dead regionserver and we 
> had to manually bypass unassign of DisableTableProc and then do ASSIGN.
> If we can break the loop for UNASSIGN procedure to not retry if there is scp 
> for that server, we do not need manual intervention?, at least the 
> DisableTableProc can go to a rollback state?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to