[
https://issues.apache.org/jira/browse/HBASE-29806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18051124#comment-18051124
]
ConfX edited comment on HBASE-29806 at 1/11/26 11:14 PM:
---------------------------------------------------------
[~zhangduo]
Thank you for the clarification. You're right that TRSP and SCP cannot be
rolled back. After deeper analysis, the issue is different from what I
initially described:
The Root Cause: Orphaned Child, Not Rollback
The NPE occurs due to an orphaned child procedure, not parent rollback. Here's
the exact flow:
1. Procedure Hierarchy: SCP -> TRSP -> OpenRegionProcedure
(RegionRemoteProcedureBase)
2. Parent TRSP Completes (success or failure): Marked isFinished=true in
procedure store
3. Crash Before Child Cleanup: Master crashes/killed before child
OpenRegionProcedure is cleaned up. Child remains in store with
isFinished=false.
4. During Recovery (ProcedureExecutor.loadProcedures()):
{code:java}
if (finished) {
completed.put(proc.getProcId(), ...); // Parent TRSP goes here
} else {
procedures.put(proc.getProcId(), proc); // Child OpenRegionProcedure goes
here
} {code}
5. NPE in afterReplay():
{code:java}
// Child's afterReplay() calls:
getParent(env).attachRemoteProc(this);
// getParent() uses:
getProcedure(getParentProcId()); // Only checks `procedures` map, NOT
`completed` map! {code}
5. Since parent is in completed map (not procedures map), getProcedure()
returns null → NPE
Key Insight
ProcedureExecutor.getProcedure(procId) only searches the procedures map:
{code:java}
ProcedureExecutor.getProcedure(procId) only searches the procedures map:
public Procedure<TEnvironment> getProcedure(final long procId) {
return procedures.get(procId); // Does NOT check `completed` map!
} {code}
h2. Production Trigger
This can happen in production with:
- kill -9 on master during procedure execution
- OOM killer terminating master
- Hardware failure / power loss
- Network partition causing ZK session expiry
The window is: between parent completion persisted and child cleanup
persisted.
was (Author: JIRAUSER296392):
[~zhangduo]
Thank you for the feedback. You're right that TRSP and SCP don't support
rollback in normal operation. Let me clarify the scenario based on deeper code
analysis:
The NPE occurs in afterReplay() which is called during procedure loading:
{code:java}
// ProcedureExecutor.java line 528-529
runnableList.forEach(p -> {
p.afterReplay(getEnvironment()); // NPE happens here for child procedure
...
}); {code}
The getParent(env) method calls:
{code:java}
return (TransitRegionStateProcedure)
env.getMasterServices().getMasterProcedureExecutor()
.getProcedure(getParentProcId()); // Returns null {code}
> master procedure executor fail restart due to NPE in
> RegionRemoteProcedureBase.afterReplay()
> --------------------------------------------------------------------------------------------
>
> Key: HBASE-29806
> URL: https://issues.apache.org/jira/browse/HBASE-29806
> Project: HBase
> Issue Type: Bug
> Components: master
> Affects Versions: 2.6.3, 2.6.4
> Reporter: ConfX
> Priority: Critical
> Attachments: afterReplay-NPE.patch
>
>
> h2. Summary
> `RegionRemoteProcedureBase.afterReplay()` does not handle the case where the
> parent `TransitRegionStateProcedure` has already been completed
> (finished/rolled back). This results in a `NullPointerException` when
> attempting to attach the remote procedure to a non-existent parent during
> procedure executor restart.
> This causes master procedure executor to fail during restart.
>
> h2. Root Cause
> *File:*
> `hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionRemoteProcedureBase.java`
>
> *Lines 277-280 (getParent method):*
> {code:java}
> private TransitRegionStateProcedure getParent(MasterProcedureEnv env) {
> return (TransitRegionStateProcedure)
> env.getMasterServices().getMasterProcedureExecutor()
> .getProcedure(getParentProcId());
> } {code}
> *Lines 391-393 (afterReplay method):*
> {code:java}
> @Override
> protected void afterReplay(MasterProcedureEnv env) {
> getParent(env).attachRemoteProc(this); // NPE if parent is null
> } {code}
> h2. Why the Bug Occurs
> 1. *Procedure Loading Behavior:* When `ProcedureExecutor` loads procedures
> after restart:
> - Finished procedures go to the `completed` map
> - Unfinished procedures go to the `procedures` map
> 2. *getProcedure Only Checks Active Procedures:* The `getProcedure(procId)`
> method at `ProcedureExecutor.java:1191-1192` only checks the `procedures` map:
> {code:java}
> public Procedure<TEnvironment> getProcedure(final long procId) {
> return procedures.get(procId); // Does not check 'completed' map
> }{code}
> 3. *Race Condition During Rollback:* During rollback of a
> `ServerCrashProcedure`:
> - The parent `TransitRegionStateProcedure` may complete/finish (rollback
> successfully)
> - Its child `RegionRemoteProcedureBase` may still be pending in the store
> - On restart, the parent is in `completed` map, child is in `procedures`
> map
> - `afterReplay()` is called on the child, but `getParent()` returns `null`
> 4. *Missing Null Check:* The `afterReplay()` method directly calls
> `getParent(env).attachRemoteProc(this)` without checking if `getParent()`
> returns null.
>
> h2. StackTrace
> {code:java}
> java.lang.NullPointerException
> at
> org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.getParent(RegionRemoteProcedureBase.java:279)
> at
> org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.afterReplay(RegionRemoteProcedureBase.java:392)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.lambda$pushProceduresAfterLoad$5(ProcedureExecutor.java:529)
> at java.util.ArrayList.forEach(ArrayList.java:1259)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.pushProceduresAfterLoad(ProcedureExecutor.java:528)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.loadProcedures(ProcedureExecutor.java:612)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$400(ProcedureExecutor.java:77)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$2.load(ProcedureExecutor.java:351)
> at
> org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.load(RegionProcedureStore.java:284)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.load(ProcedureExecutor.java:342)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.init(ProcedureExecutor.java:665)
> {code}
> h2. Proposed Fix
> *Option 1: Null Check in afterReplay()*
> {code:java}
> @Override
> protected void afterReplay(MasterProcedureEnv env) {
> TransitRegionStateProcedure parent = getParent(env);
> if (parent != null) {
> parent.attachRemoteProc(this);
> } else {
> // Parent procedure has already completed - this child should be cleaned
> up
> LOG.warn("Parent procedure {} not found for {}. Parent may have been
> completed/rolled back.",
> getParentProcId(), this);
> // Consider setting this procedure's state to allow cleanup
> }
> } {code}
> *Option 2: Lookup in Both Maps*
> Modify `getParent()` to also check the completed procedures:
> {code:java}
> private TransitRegionStateProcedure getParent(MasterProcedureEnv env) {
> Procedure<?> parent = env.getMasterServices().getMasterProcedureExecutor()
> .getProcedure(getParentProcId());
> if (parent == null) {
> // Check if parent is in completed procedures
> parent = env.getMasterServices().getMasterProcedureExecutor()
> .getResult(getParentProcId());
> }
> return (TransitRegionStateProcedure) parent;
> } {code}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)