[
https://issues.apache.org/jira/browse/HBASE-28522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17837454#comment-17837454
]
Prathyusha edited comment on HBASE-28522 at 4/16/24 10:39 AM:
--------------------------------------------------------------
>The flow by design is SCP will interrupt the TRSP to assign the region first,
>and then unassign it.
True, from my understanding this code path should take care of it
SCP#assingRegions
{color:#000000} {color}{color:#7f0055}if{color}{color:#000000}
({color}{color:#6a3e3e}regionNode{color}{color:#000000}.getProcedure() !=
{color}{color:#7f0055}null{color}{color:#000000}) {{color}
{color:#000000}
{color}{color:#0000c0}LOG{color}{color:#000000}.info({color}{color:#2a00ff}"{}
found RIT {}; {}"{color}{color:#000000},
{color}{color:#7f0055}this{color}{color:#000000},
{color}{color:#6a3e3e}regionNode{color}{color:#000000}.getProcedure(),
{color}{color:#6a3e3e}regionNode{color}{color:#000000});{color}
{color:#000000}
{color}{color:#6a3e3e}regionNode{color}{color:#000000}.getProcedure().{color}{color:#000000}serverCrashed({color}{color:#6a3e3e}env{color}{color:#000000},
{color}{color:#6a3e3e}regionNode{color}{color:#000000}, getServerName(),{color}
{color:#000000}
!{color}{color:#6a3e3e}retainAssignment{color}{color:#000000}){color}{color:#000000};{color}
{color:#000000} {color}{color:#7f0055}continue{color}{color:#000000};{color}
{color:#000000} }{color}
{color:#000000} {color}{color:#7f0055}if{color}{color:#000000} ({color}
{color:#000000}
{color}{color:#6a3e3e}env{color}{color:#000000}.getMasterServices().getTableStateManager().isTableState({color}{color:#6a3e3e}regionNode{color}{color:#000000}.getTable(),{color}
{color:#000000}
TableState.State.{color}{color:#0000c0}DISABLING{color}{color:#000000}){color}
{color:#000000} ) {{color}
{color:#000000} {color}{color:#3f7f5f}// We need to change the state here
otherwise the TRSP scheduled by DTP will try to{color}
{color:#000000} {color}{color:#3f7f5f}// close the region from a dead server
and will never succeed. Please see HBASE-23636{color}
{color:#000000} {color}{color:#3f7f5f}// for more details.{color}
{color:#000000}
{color}{color:#6a3e3e}env{color}{color:#000000}.getAssignmentManager().regionClosedAbnormally({color}{color:#6a3e3e}regionNode{color}{color:#000000});{color}
{color:#000000}
{color}{color:#0000c0}LOG{color}{color:#000000}.info({color}{color:#2a00ff}"{}
found table disabling for region {}, set it state to
ABNORMALLY_CLOSED."{color}{color:#000000},{color}
{color:#000000} {color}{color:#7f0055}this{color}{color:#000000},
{color}{color:#6a3e3e}regionNode{color}{color:#000000});{color}
{color:#000000} {color}{color:#7f0055}continue{color}{color:#000000};{color}
{color:#000000} }{color}
{color:#000000} {color}{color:#7f0055}if{color}{color:#000000} ({color}
{color:#000000}
{color}{color:#6a3e3e}env{color}{color:#000000}.getMasterServices().getTableStateManager().isTableState({color}{color:#6a3e3e}regionNode{color}{color:#000000}.getTable(),{color}
{color:#000000}
TableState.State.{color}{color:#0000c0}DISABLED{color}{color:#000000}){color}
{color:#000000} ) {{color}
{color:#000000} {color}{color:#3f7f5f}// This should not happen, table disabled
but has regions on server.{color}
{color:#000000}
{color}{color:#0000c0}LOG{color}{color:#000000}.warn({color}{color:#2a00ff}"Found
table disabled for region {}, procDetails: {}"{color}{color:#000000},
{color}{color:#6a3e3e}regionNode{color}{color:#000000},
{color}{color:#7f0055}this{color}{color:#000000});{color}
{color:#000000} {color}{color:#7f0055}continue{color}{color:#000000};{color}
{color:#000000} }{color}
{color:#000000} TransitRegionStateProcedure
{color}{color:#6a3e3e}proc{color}{color:#000000} ={color}
{color:#000000}
TransitRegionStateProcedure.{color}{color:#000000}assign{color}{color:#000000}({color}{color:#6a3e3e}env{color}{color:#000000},
{color}{color:#6a3e3e}region{color}{color:#000000},
!{color}{color:#6a3e3e}retainAssignment{color}{color:#000000},
{color}{color:#7f0055}null{color}{color:#000000});{color}
{color:#000000}
{color}{color:#6a3e3e}regionNode{color}{color:#000000}.setProcedure({color}{color:#6a3e3e}proc{color}{color:#000000});{color}
{color:#000000}
addChildProcedure({color}{color:#6a3e3e}proc{color}{color:#000000});
---------
but we did not see "found RIT" {color}log lines and SCP was triggered a bit
before DisableTableProc set the table state to DISABLING.
So it has set the ASSIGN proc in regionNode, before DisableTableProc has
triggered {color:#0747a6}forceCreateUnssignProcedure {color}{color:#172b4d}and
this essentially again is overriding the current proc of regionNode (which
should be the child assign of TRSP){color}
{color:#000000} {color}{color:#7f0055}if{color}{color:#000000}
({color}{color:#6a3e3e}regionNode{color}{color:#000000}.getProcedure() !=
{color}{color:#7f0055}null{color}{color:#000000}) {{color}
{color:#000000}
{color}{color:#6a3e3e}regionNode{color}{color:#000000}.unsetProcedure({color}{color:#6a3e3e}regionNode{color}{color:#000000}.getProcedure());{color}
{color:#000000} }{color}
{color:#000000} {color}{color:#7f0055}return{color}{color:#000000}
{color}{color:#6a3e3e}regionNode{color}{color:#000000}.setProcedure(TransitRegionStateProcedure.{color}{color:#000000}unassign{color}{color:#000000}(getProcedureEnvironment(),{color}
{color:#000000}
{color}{color:#6a3e3e}regionNode{color}{color:#000000}.getRegionInfo()));{color}
---------------------------------------------------------------------------
Now the Assign proc of SCP also was waiting on the shared Table lock, but
DisableTableProc must have taken the table exclusive lock blocking ASSIGN of
SCP.
{color:#4c9aff}2024-03-16 17:59:23,003 DEBUG [PEWorker-40]
procedure2.ProcedureExecutor - LOCK_EVENT_WAIT pid=21594220, ppid=21592440,
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE;
TransitRegionStateProcedure table=<TABLE_NAME>, region=<regionhash>,
ASSIGN{color}
It looks like if the SCP was triggered a bit later, it would have interrupted
current child UNASSIGN of DisableTableProc.
[~zhangduo] [~umesh9414]
was (Author: prathyu6):
>The flow by design is SCP will interrupt the TRSP to assign the region first,
>and then unassign it.
True, from my understanding this code path should take care of it
SCP#assingRegions
{color:#000000} {color}{color:#7f0055}if{color}{color:#000000}
({color}{color:#6a3e3e}regionNode{color}{color:#000000}.getProcedure() !=
{color}{color:#7f0055}null{color}{color:#000000}) {{color}
{color:#000000}
{color}{color:#0000c0}LOG{color}{color:#000000}.info({color}{color:#2a00ff}"{}
found RIT {}; {}"{color}{color:#000000},
{color}{color:#7f0055}this{color}{color:#000000},
{color}{color:#6a3e3e}regionNode{color}{color:#000000}.getProcedure(),
{color}{color:#6a3e3e}regionNode{color}{color:#000000});{color}
{color:#000000}
{color}{color:#6a3e3e}regionNode{color}{color:#000000}.getProcedure().{color}{color:#000000}serverCrashed({color}{color:#6a3e3e}env{color}{color:#000000},
{color}{color:#6a3e3e}regionNode{color}{color:#000000}, getServerName(),{color}
{color:#000000}
!{color}{color:#6a3e3e}retainAssignment{color}{color:#000000}){color}{color:#000000};{color}
{color:#000000} {color}{color:#7f0055}continue{color}{color:#000000};{color}
{color:#000000} }{color}
{color:#000000} {color}{color:#7f0055}if{color}{color:#000000} ({color}
{color:#000000}
{color}{color:#6a3e3e}env{color}{color:#000000}.getMasterServices().getTableStateManager().isTableState({color}{color:#6a3e3e}regionNode{color}{color:#000000}.getTable(),{color}
{color:#000000}
TableState.State.{color}{color:#0000c0}DISABLING{color}{color:#000000}){color}
{color:#000000} ) {{color}
{color:#000000} {color}{color:#3f7f5f}// We need to change the state here
otherwise the TRSP scheduled by DTP will try to{color}
{color:#000000} {color}{color:#3f7f5f}// close the region from a dead server
and will never succeed. Please see HBASE-23636{color}
{color:#000000} {color}{color:#3f7f5f}// for more details.{color}
{color:#000000}
{color}{color:#6a3e3e}env{color}{color:#000000}.getAssignmentManager().regionClosedAbnormally({color}{color:#6a3e3e}regionNode{color}{color:#000000});{color}
{color:#000000}
{color}{color:#0000c0}LOG{color}{color:#000000}.info({color}{color:#2a00ff}"{}
found table disabling for region {}, set it state to
ABNORMALLY_CLOSED."{color}{color:#000000},{color}
{color:#000000} {color}{color:#7f0055}this{color}{color:#000000},
{color}{color:#6a3e3e}regionNode{color}{color:#000000});{color}
{color:#000000} {color}{color:#7f0055}continue{color}{color:#000000};{color}
{color:#000000} }{color}
{color:#000000} {color}{color:#7f0055}if{color}{color:#000000} ({color}
{color:#000000}
{color}{color:#6a3e3e}env{color}{color:#000000}.getMasterServices().getTableStateManager().isTableState({color}{color:#6a3e3e}regionNode{color}{color:#000000}.getTable(),{color}
{color:#000000}
TableState.State.{color}{color:#0000c0}DISABLED{color}{color:#000000}){color}
{color:#000000} ) {{color}
{color:#000000} {color}{color:#3f7f5f}// This should not happen, table disabled
but has regions on server.{color}
{color:#000000}
{color}{color:#0000c0}LOG{color}{color:#000000}.warn({color}{color:#2a00ff}"Found
table disabled for region {}, procDetails: {}"{color}{color:#000000},
{color}{color:#6a3e3e}regionNode{color}{color:#000000},
{color}{color:#7f0055}this{color}{color:#000000});{color}
{color:#000000} {color}{color:#7f0055}continue{color}{color:#000000};{color}
{color:#000000} }{color}
{color:#000000} TransitRegionStateProcedure
{color}{color:#6a3e3e}proc{color}{color:#000000} ={color}
{color:#000000}
TransitRegionStateProcedure.{color}{color:#000000}assign{color}{color:#000000}({color}{color:#6a3e3e}env{color}{color:#000000},
{color}{color:#6a3e3e}region{color}{color:#000000},
!{color}{color:#6a3e3e}retainAssignment{color}{color:#000000},
{color}{color:#7f0055}null{color}{color:#000000});{color}
{color:#000000}
{color}{color:#6a3e3e}regionNode{color}{color:#000000}.setProcedure({color}{color:#6a3e3e}proc{color}{color:#000000});{color}
{color:#000000}
addChildProcedure({color}{color:#6a3e3e}proc{color}{color:#000000}{color:#000000});
---------
but we did not see "{color:#2a00ff}found RIT" {color}log lines and SCP was
triggered a bit before DisableTableProc set the table state to DISABLING.
So it has set the ASSIGN proc in regionNode, before DisableTableProc has
triggered {color:#0747a6}forceCreateUnssignProcedure {color}{color:#172b4d}and
this essentially again is overriding the current proc of regionNode (which
should be the child assign of TRSP)
{color}
{color}{color}
{color:#000000} {color}{color:#7f0055}if{color}{color:#000000}
({color}{color:#6a3e3e}regionNode{color}{color:#000000}.getProcedure() !=
{color}{color:#7f0055}null{color}{color:#000000}) {{color}
{color:#000000}
{color}{color:#6a3e3e}regionNode{color}{color:#000000}.unsetProcedure({color}{color:#6a3e3e}regionNode{color}{color:#000000}.getProcedure());{color}
{color:#000000} }{color}
{color:#000000} {color}{color:#7f0055}return{color}{color:#000000}
{color}{color:#6a3e3e}regionNode{color}{color:#000000}.setProcedure(TransitRegionStateProcedure.{color}{color:#000000}unassign{color}{color:#000000}(getProcedureEnvironment(),{color}
{color:#000000}
{color}{color:#6a3e3e}regionNode{color}{color:#000000}.getRegionInfo()));{color}
------
Now the Assign proc of SCP also was waiting on the shared Table lock, but
DisableTableProc must have taken the table exclusive lock blocking ASSIGN of
SCP.
{color:#4c9aff}2024-03-16 17:59:23,003 DEBUG [PEWorker-40]
procedure2.ProcedureExecutor - LOCK_EVENT_WAIT pid=21594220, ppid=21592440,
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE;
TransitRegionStateProcedure table=<TABLE_NAME>, region=<regionhash>,
ASSIGN{color}
It looks like if the SCP was triggered a bit later, it would have interrupted
current child UNASSIGN of DisableTableProc
> UNASSIGN proc indefinitely stuck on dead rs
> -------------------------------------------
>
> Key: HBASE-28522
> URL: https://issues.apache.org/jira/browse/HBASE-28522
> Project: HBase
> Issue Type: Improvement
> Components: proc-v2
> Reporter: Prathyusha
> Priority: Minor
>
> One scenario we noticed in production -
> we had DisableTableProc and SCP almost triggered at similar time
> 2024-03-16 17:59:23,014 INFO [PEWorker-11] procedure.DisableTableProcedure -
> Set <TABLE_NAME> to state=DISABLING
> 2024-03-16 17:59:15,243 INFO [PEWorker-26] procedure.ServerCrashProcedure -
> Start pid=21592440, state=RUNNABLE:SERVER_CRASH_START, locked=true;
> ServerCrashProcedure
> <regionserver>, splitWal=true, meta=false
> DisabeTableProc creates unassign procs, and at this time ASSIGNs of SCP is
> not completed
> {{2024-03-16 17:59:23,003 DEBUG [PEWorker-40] procedure2.ProcedureExecutor -
> LOCK_EVENT_WAIT pid=21594220, ppid=21592440,
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE;
> TransitRegionStateProcedure table=<TABLE_NAME>, region=<regionhash>, ASSIGN}}
> UNASSIGN created by DisableTableProc is stuck on the dead regionserver and we
> had to manually bypass unassign of DisableTableProc and then do ASSIGN.
> If we can break the loop for UNASSIGN procedure to not retry if there is scp
> for that server, we do not need manual intervention?, at least the
> DisableTableProc can go to a rollback state?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)