[ 
https://issues.apache.org/jira/browse/HBASE-29713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18039007#comment-18039007
 ] 

Hudson commented on HBASE-29713:
--------------------------------

Results for branch branch-2.5
        [build #750 on 
builds.a.o|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.5/750/]:
 (x) *{color:red}-1 overall{color}*
----
details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.5/750/General_20Nightly_20Build_20Report/]


(/) {color:green}+1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.5/750/JDK8_20Nightly_20Build_20Report_20_28Hadoop2_29/]


(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.5/750/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(x) {color:red}-1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.5/750/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(x) {color:red}-1 jdk17 hadoop3 checks{color}
-- For more information [see jdk17 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.5/750/JDK17_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 jdk17 hadoop 3.2.4 backward compatibility checks{color}
-- For more information [see jdk17 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.5/750/JDK17_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(x) {color:red}-1 jdk17 hadoop 3.3.5 backward compatibility checks{color}
-- For more information [see jdk17 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.5/750/JDK17_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(x) {color:red}-1 jdk17 hadoop 3.3.6 backward compatibility checks{color}
-- For more information [see jdk17 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.5/750/JDK17_20Nightly_20Build_20Report_20_28Hadoop3_29/]




(x) {color:red}-1 source release artifact{color}
-- Something went wrong with this stage, [check relevant console 
output|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.5/750//console].


(x) {color:red}-1 client integration test{color}
-- Something went wrong with this stage, [check relevant console 
output|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.5/750//console].


> HMaster loosing track of SplitWalProcedure because race condition while 
> suspending and waking up the procedure
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-29713
>                 URL: https://issues.apache.org/jira/browse/HBASE-29713
>             Project: HBase
>          Issue Type: Bug
>          Components: proc-v2
>    Affects Versions: 2.5.0, 2.5.13
>            Reporter: Umesh Kumar Kumawat
>            Assignee: Umesh Kumar Kumawat
>            Priority: Major
>              Labels: pull-request-available
>
> When there are too many splitWalProecedure at a time and there are no worker 
> available we suspend the procedure that try to acquire the worker and then 
> wake the procedure again once some procedure get completed and it releases 
> the worker. 
> ([code|https://github.com/apache/hbase/blob/branch-2.5/hbase-server/src/main/java/org/apache/hadoop/hbase/master/SplitWALManager.java#L155-L164])
>  
> {code:java}
> public ServerName acquireSplitWALWorker(Procedure<?> procedure)
>   throws ProcedureSuspendedException {
>   Optional<ServerName> worker = splitWorkerAssigner.acquire();
>   if (worker.isPresent()) {
>     LOG.debug("Acquired split WAL worker={}", worker.get());
>     return worker.get();
>   }
>   splitWorkerAssigner.suspend(procedure);
>   throw new ProcedureSuspendedException();
> }{code}
>  
> [Definition|https://github.com/apache/hbase/blob/branch-2.5/hbase-server/src/main/java/org/apache/hadoop/hbase/master/SplitWALManager.java#L228C1-L231C6]
>  of splitWorkerAssigner.suspend(procedure)
>  
> {code:java}
>     public void suspend(Procedure<?> proc) {
>       event.suspend();
>       event.suspendIfNotReady(proc);
>     }
> {code}
>  
> [Definition of both event.suspend() and 
> event.suspendIfNotReady(proc)|https://github.com/apache/hbase/blob/branch-2.5/hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/ProcedureEvent.java#L47-L60]
> {code:java}
> /** Mark the event as not ready. */
>   public synchronized void suspend() {
>     ready = false;
>     if (LOG.isTraceEnabled()) {
>       LOG.trace("Suspend " + toString());
>     }
>   }
>   /**
>    * Returns true if event is not ready and adds procedure to suspended 
> queue, else returns false.
>    */
>   public synchronized boolean suspendIfNotReady(Procedure proc) {
>     if (!ready) {
>       suspendedProcedures.addLast(proc);
>     }
>     return !ready;
>   }
> {code}
>  
> Notice In event.suspendIfNotReady(proc) we add the procedure to 
> suspendedProcedure list only if read is false.
>  
> *Flow of release the worker*
> Once a SplitWalProcedure got completed and a worker get freed, we wake this 
> event and mark ready true. 
> ([code|https://github.com/apache/hbase/blob/branch-2.5/hbase-server/src/main/java/org/apache/hadoop/hbase/master/SplitWALManager.java#L233-L237])
> {code:java}
>     public void wake(MasterProcedureScheduler scheduler) {
>       if (!event.isReady()) {
>         event.wake(scheduler);
>       }
>     }{code}
>  
> The suspend() and wake() methods in SplitWorkerAssigner are not synchronized, 
> allowing this race condition.
>   Race Condition Timeline:
>   Thread-1 (Suspending Procedure):           Thread-2 (Releasing Worker):
>   1. suspend() called 
>   2. event.suspend() called
>      → Sets ready = false
>   3. [THREAD PAUSES before next line]        4. wake() called 
>                                                                           5. 
> if (!event.isReady()) → TRUE
>                                                                           6. 
> event.wake(scheduler) caleed
>                                                                               
> → Sets ready = true
>                                                                               
> → Wakes suspended queue (empty!)
>   7. event.suspendIfNotReady(proc) at line 232
>      → Checks ready = true
>      → Does NOT add procedure to suspended queue
>   8. ProcedureExecuter consdier this procedure marked as SUSPENDED. Now there 
> is not way to add this proceudre back to schedulerQueue.
>      → NEVER WOKEN UP!
>  
> In 2.6 branch when we introduce WorkerAssigner.java as part of 
> SnapshotProcedure, we changed these methods to be syschronized, that solved 
> the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to