[ 
https://issues.apache.org/jira/browse/HBASE-30143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18078821#comment-18078821
 ] 

Duo Zhang commented on HBASE-30143:
-----------------------------------

Region in CLOSED state should not have recovered.edits. Could you please check 
earlier logs when opening the region? Did it successfully removed the recovered 
edits after opening?

>  ProcedureExecutor orphans FAILED procedures with holdLock=true when 
> setRollback() races with child release() 
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-30143
>                 URL: https://issues.apache.org/jira/browse/HBASE-30143
>             Project: HBase
>          Issue Type: Bug
>          Components: proc-v2, Region Assignment
>    Affects Versions: 2.6.5, 2.5.14
>         Environment: Any HBase deployment running splits/merges or other 
> StateMachineProcedures with holdLock()==true under concurrent worker load
>            Reporter: Kiran Kumar Maturi
>            Assignee: Kiran Kumar Maturi
>            Priority: Minor
>
> h3. Summary
> {\{ProcedureExecutor.executeProcedure()}} can leave a 
> \{{StateMachineProcedure}} with         {\{holdLock()==true}} in an orphaned 
> state: \{{ProcedureState.FAILED}}, exclusive lock held, and not present on 
> any scheduler queue. No event ever re-awakens it; the only recovery is master 
> failover (via \{{loadProcedures() -> 
> failedList.forEach(scheduler::addBack)}}).                                    
>                                                
> In production we observed this as an HBase region stuck CLOSED for 5h 37m 
> after a              {\{SplitTableRegionProcedure}} hit "Recovered.edits are 
> found" during                                        
> {\{SPLIT_TABLE_REGIONS_CHECK_CLOSED_REGIONS}}. The region was completely 
> unavailable to clients for the entire duration. Master failover released the 
> lock and rollback finally ran.
> Race between two workers when a parent procedure calls {{setFailure()}} while 
> a sibling/child             
>   procedure has not yet returned from {{procStack.release()}}.
>   Relevant code paths (numbers from branch-2.6):
>   * {{ProcedureExecutor.executeProcedure()}} lines 1414-1489 — outer do-while 
> loop.
>   * {{RootProcedureState.setRollback()}} line 85 — guarded by {{running == 0 
> && state == FAILED}}.
>   * {{RootProcedureState.acquire()}} line 138 — increments {{running}}; 
> {{release()}} at 150 decrements.                                              
>                     
>   * {{ProcedureExecutor.releaseLock()}} line 1502-1509 — skips release when   
>                                                                               
>               
>     {{proc.holdLock(env)==true && !proc.isFinished()}}. {{isFinished()}} is 
> only true for                                                                 
>                 
>     SUCCESS/ROLLEDBACK, NOT for FAILED.    
>  Timeline of the race:                                                        
>                                                                               
>              
>                                                                               
>                                                                               
>               
>   || T || Worker-A (child) || Worker-B (parent) || running || state ||        
>                                                                               
>               
>   |  0 | acquire(child) | — | 1 | RUNNING |                                   
>                               
>   |  1 | child execute returns SUCCESS | — | 1 | RUNNING |                    
>                                                                               
>               
>   |  2 | countDownChildren → scheduler.addFront(parent) | — | 1 | RUNNING |   
>                                                                               
>               
>   |  3 | — | picks up parent | 1 | RUNNING |                                  
>                                                                               
>               
>   |  4 | — | acquire(parent) | 2 | RUNNING |                                  
>                                                                               
>               
>   |  5 | — | executeFromState throws, setFailure() | 2 | FAILED |             
>                                                                               
>               
>   |  6 | — | execProcedure returns | 2 | FAILED |                             
>                                                                               
>               
>   |  7 | — | do-while re-enters, acquire() returns false | 2 | FAILED |       
>                                                                               
>               
>   |  8 | — | setRollback() returns false (running != 0) | 2 | FAILED |        
>                                                                               
>               
>   |  9 | — | else-branch, wasExecuted()==true, break; | 2 | FAILED |          
>                                                                               
>               
>   | 10 | release(child) | Worker-B returns | 1 | FAILED |      
>  From T+10: procedure is FAILED, {{holdLock=true}} prevented 
> {{releaseLock()}} at T+6 from                                                 
>                               
>   releasing the xlock, and nothing re-enqueues the root. The child's          
>                                                                               
>               
>   {{countDownChildren}} wake-up was consumed at T+3 and there is no further 
> event generator.   
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to