teamconfx commented on PR #7667: URL: https://github.com/apache/hbase/pull/7667#issuecomment-3793694428
Thank you for the feedback. You're correct that under normal operation, a
parent procedure cannot complete while a sub procedure is still pending - this
is the intended invariant.
However, this NPE occurs specifically during crash recovery
(afterReplay()), not during normal execution. During crash recovery, the
procedure store may contain inconsistent state that violates this invariant due
to:
1. Partial state persisted before crash - The multi-step rollback process
can be interrupted at various points
2. Non-atomic store operations - The store.delete() calls for parent and
child procedures are not atomic
The HBase codebase already handles this scenario in other places. For
example, in ProcedureExecutor.countDownChildren() (line 1985-1989):
```java
Procedure<TEnvironment> parent =
procedures.get(procedure.getParentProcId());
if (parent == null) {
assert procStack.isRollingback();
return;
}
```
This shows that null parent scenarios are already recognized and handled
during rollback.
The `setKillAndToggleBeforeStoreUpdateInRollback` test mechanism exists
specifically to test crash recovery scenarios. The fact that this test exposes
the NPE demonstrates that crash recovery can lead to states where a child
exists
without its parent.
The proposed fix is a defensive null check - consistent with the pattern
already used in countDownChildren(). This ensures the master can recover even
from unexpected procedure store states, rather than failing startup with an
NPE.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
