teamconfx commented on PR #7667:
URL: https://github.com/apache/hbase/pull/7667#issuecomment-3793694428

     Thank you for the feedback. You're correct that under normal operation, a 
parent procedure cannot complete while a sub procedure is still pending - this 
is the intended invariant.                                                      
 
                                                                                
                                                                                
                                                                               
     However, this NPE occurs specifically during crash recovery 
(afterReplay()), not during normal execution. During crash recovery, the 
procedure store may contain inconsistent state that violates this invariant due 
to:                  
                                                                                
                                                                                
                                                                               
     1. Partial state persisted before crash - The multi-step rollback process 
can be interrupted at various points                                            
                                                                                
     2. Non-atomic store operations - The store.delete() calls for parent and 
child procedures are not atomic                                                 
                                                                                
 
                                                                                
                                                                                
                                                                               
     The HBase codebase already handles this scenario in other places. For 
example, in ProcedureExecutor.countDownChildren() (line 1985-1989):             
                                                                                
    
   ```java                                                                      
                                                                                
                                                                                
      
     Procedure<TEnvironment> parent = 
procedures.get(procedure.getParentProcId());                                    
                                                                                
                                         
     if (parent == null) {                                                      
                                                                                
                                                                               
       assert procStack.isRollingback();                                        
                                                                                
                                                                               
       return;                                                                  
                                                                                
                                                                               
     }         
   ```                                                                          
                                                                                
                                                                      
                                                                                
                                                                                
                                                                               
     This shows that null parent scenarios are already recognized and handled 
during rollback.                                                                
                                                                                
 
                                                                                
                                                                                
                                                                               
     The `setKillAndToggleBeforeStoreUpdateInRollback` test mechanism exists 
specifically to test crash recovery scenarios. The fact that this test exposes 
the NPE demonstrates that crash recovery can lead to states where a child 
exists     
     without its parent.                                                        
                                                                                
                                                                               
                                                                                
                                                                                
                                                                               
     The proposed fix is a defensive null check - consistent with the pattern 
already used in countDownChildren(). This ensures the master can recover even 
from unexpected procedure store states, rather than failing startup with an 
NPE. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to