[
https://issues.apache.org/jira/browse/HBASE-30143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18079312#comment-18079312
]
Kiran Kumar Maturi commented on HBASE-30143:
--------------------------------------------
The region was in stuck state for long and after hmaster was restarted the
region was assigned
{code:java}
---
2026-04-01 21:55:46,306 INFO
[master/<hmaster-1-host>:17000:becomeActiveMaster] master.HMaster: hmaster-1
becoming active master
2026-04-01 21:55:46,306 INFO
[master/<hmaster-1-host>:17000:becomeActiveMaster]
procedure2.ProcedureExecutor: Starting procedure executor; loading procedures
2026-04-01 21:55:46,361 INFO
[master/<hmaster-1-host>:17000:becomeActiveMaster]
procedure2.ProcedureExecutor: Loaded pid=14060510, state=FAILED, locked=true,
exception=java.io.IOException: Recovered.edits are found in Region: {ENCODED
=> fcc017f900f94981ad490e291dd70dfe}, abort split/merge to prevent data loss;
SplitTableRegionProcedure
table=tsdb, parent=fcc017f900f94981ad490e291dd70dfe,
daughterA=a7439e4c913b08c90c2ca6be66d46683,
daughterB=f67ce33a4fcf4cc4f9bc8c829857dbf1; stack ids=[0, 1, 2, 7]; held the
lock before
restarting
2026-04-01 21:55:46,363 DEBUG
[master/<hmaster-1-host>:17000:becomeActiveMaster]
procedure2.ProcedureExecutor: Re-acquired lock for pid=14060510, state=FAILED,
locked=true;
SplitTableRegionProcedure table=tsdb, parent=fcc017f900f94981ad490e291dd70dfe
2026-04-01 21:55:46,363 INFO
[master/<hmaster-1-host>:17000:becomeActiveMaster]
procedure2.ProcedureExecutor: Re-enqueueing failed procedure pid=14060510 via
failedList
2026-04-01 21:55:52,597 INFO
[master/<hmaster-1-host>:17000:becomeActiveMaster]
assignment.AssignmentManager: Loaded hbase:meta state for region
fcc017f900f94981ad490e291dd70dfe:
state=CLOSED, table=tsdb, regionLocation=null
---
2026-04-01 21:55:57,829 INFO [PEWorker-12] procedure2.ProcedureExecutor:
Initialized procedure pid=14061763, state=RUNNABLE; TransitRegionStateProcedure
table=tsdb,
region=fcc017f900f94981ad490e291dd70dfe, ASSIGN — waiting for lock held by
pid=14060510
2026-04-01 21:55:57,877 INFO [PEWorker-46] procedure2.ProcedureExecutor:
Rolled back pid=14060510, exec-time=5hrs, 37mins, 34.04sec,
exception=java.io.IOException: Recovered.edits are
found in Region: {ENCODED => fcc017f900f94981ad490e291dd70dfe}, abort
split/merge to prevent data loss; SplitTableRegionProcedure table=tsdb,
parent=fcc017f900f94981ad490e291dd70dfe,
daughterA=a7439e4c913b08c90c2ca6be66d46683,
daughterB=f67ce33a4fcf4cc4f9bc8c829857dbf1
2026-04-01 21:55:57,877 DEBUG [PEWorker-46] procedure2.ProcedureExecutor:
Released lock for pid=14060510
----
2026-04-01 21:55:57,884 INFO [PEWorker-12]
assignment.TransitRegionStateProcedure: Starting pid=14061763,
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE;
TransitRegionStateProcedure table=tsdb,
region=fcc017f900f94981ad490e291dd70dfe, ASSIGN
2026-04-01 21:55:58,053 INFO [PEWorker-12] assignment.RegionStateStore:
pid=14061763 updating hbase:meta row=fcc017f900f94981ad490e291dd70dfe,
regionState=OPENING,
regionLocation=<rs-host>,16020,<startcode>
2026-04-01 21:55:58,919 INFO [PEWorker-12] assignment.RegionStateStore:
pid=14061763 updating hbase:meta row=fcc017f900f94981ad490e291dd70dfe,
regionState=OPEN, openSeqNum=2020545206,
regionLocation=<rs-host>,16020,<startcode>
2026-04-01 21:55:58,941 INFO [PEWorker-12] assignment.AssignmentManager:
Removed region fcc017f900f94981ad490e291dd70dfe from RIT list (state=OPEN)
2026-04-01 21:55:59,003 INFO [PEWorker-12] procedure2.ProcedureExecutor:
Finished pid=14061763, state=SUCCESS; TransitRegionStateProcedure table=tsdb,
region=fcc017f900f94981ad490e291dd70dfe, ASSIGN in 1.174 sec {code}
> ProcedureExecutor orphans FAILED procedures with holdLock=true when
> setRollback() races with child release()
> --------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-30143
> URL: https://issues.apache.org/jira/browse/HBASE-30143
> Project: HBase
> Issue Type: Bug
> Components: proc-v2, Region Assignment
> Affects Versions: 2.6.5, 2.5.14
> Environment: Any HBase deployment running splits/merges or other
> StateMachineProcedures with holdLock()==true under concurrent worker load
> Reporter: Kiran Kumar Maturi
> Assignee: Kiran Kumar Maturi
> Priority: Minor
>
> h3. Summary
> {\{ProcedureExecutor.executeProcedure()}} can leave a
> \{{StateMachineProcedure}} with {\{holdLock()==true}} in an orphaned
> state: \{{ProcedureState.FAILED}}, exclusive lock held, and not present on
> any scheduler queue. No event ever re-awakens it; the only recovery is master
> failover (via \{{loadProcedures() ->
> failedList.forEach(scheduler::addBack)}}).
>
> In production we observed this as an HBase region stuck CLOSED for 5h 37m
> after a {\{SplitTableRegionProcedure}} hit "Recovered.edits are
> found" during
> {\{SPLIT_TABLE_REGIONS_CHECK_CLOSED_REGIONS}}. The region was completely
> unavailable to clients for the entire duration. Master failover released the
> lock and rollback finally ran.
> Race between two workers when a parent procedure calls {{setFailure()}} while
> a sibling/child
> procedure has not yet returned from {{procStack.release()}}.
> Relevant code paths (numbers from branch-2.6):
> * {{ProcedureExecutor.executeProcedure()}} lines 1414-1489 — outer do-while
> loop.
> * {{RootProcedureState.setRollback()}} line 85 — guarded by {{running == 0
> && state == FAILED}}.
> * {{RootProcedureState.acquire()}} line 138 — increments {{running}};
> {{release()}} at 150 decrements.
>
> * {{ProcedureExecutor.releaseLock()}} line 1502-1509 — skips release when
>
>
> {{proc.holdLock(env)==true && !proc.isFinished()}}. {{isFinished()}} is
> only true for
>
> SUCCESS/ROLLEDBACK, NOT for FAILED.
> Timeline of the race:
>
>
>
>
>
> || T || Worker-A (child) || Worker-B (parent) || running || state ||
>
>
> | 0 | acquire(child) | — | 1 | RUNNING |
>
> | 1 | child execute returns SUCCESS | — | 1 | RUNNING |
>
>
> | 2 | countDownChildren → scheduler.addFront(parent) | — | 1 | RUNNING |
>
>
> | 3 | — | picks up parent | 1 | RUNNING |
>
>
> | 4 | — | acquire(parent) | 2 | RUNNING |
>
>
> | 5 | — | executeFromState throws, setFailure() | 2 | FAILED |
>
>
> | 6 | — | execProcedure returns | 2 | FAILED |
>
>
> | 7 | — | do-while re-enters, acquire() returns false | 2 | FAILED |
>
>
> | 8 | — | setRollback() returns false (running != 0) | 2 | FAILED |
>
>
> | 9 | — | else-branch, wasExecuted()==true, break; | 2 | FAILED |
>
>
> | 10 | release(child) | Worker-B returns | 1 | FAILED |
> From T+10: procedure is FAILED, {{holdLock=true}} prevented
> {{releaseLock()}} at T+6 from
>
> releasing the xlock, and nothing re-enqueues the root. The child's
>
>
> {{countDownChildren}} wake-up was consumed at T+3 and there is no further
> event generator.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)