Umesh Kumar Kumawat created HBASE-29713:
-------------------------------------------

             Summary: HMaster loosing track of SplitWalProcedure because race 
condition while suspending and waking up the procedure
                 Key: HBASE-29713
                 URL: https://issues.apache.org/jira/browse/HBASE-29713
             Project: HBase
          Issue Type: Bug
          Components: proc-v2
    Affects Versions: 2.5.13, 2.5.0
            Reporter: Umesh Kumar Kumawat


When there are too many splitWalProecedure at a time and there are no worker 
available we suspend the procedure that try to acquire the worker and then wake 
the procedure again once some procedure get completed and it releases the 
worker. 

([code|https://github.com/apache/hbase/blob/branch-2.5/hbase-server/src/main/java/org/apache/hadoop/hbase/master/SplitWALManager.java#L155-L164])

 
{code:java}
public ServerName acquireSplitWALWorker(Procedure<?> procedure)
  throws ProcedureSuspendedException {
  Optional<ServerName> worker = splitWorkerAssigner.acquire();
  if (worker.isPresent()) {
    LOG.debug("Acquired split WAL worker={}", worker.get());
    return worker.get();
  }
  splitWorkerAssigner.suspend(procedure);
  throw new ProcedureSuspendedException();
}{code}
 

[Definition|https://github.com/apache/hbase/blob/branch-2.5/hbase-server/src/main/java/org/apache/hadoop/hbase/master/SplitWALManager.java#L228C1-L231C6]
 of splitWorkerAssigner.suspend(procedure)

 
{code:java}
    public void suspend(Procedure<?> proc) {
      event.suspend();
      event.suspendIfNotReady(proc);
    }
{code}
 

[Definition of both event.suspend() and 
event.suspendIfNotReady(proc)|https://github.com/apache/hbase/blob/branch-2.5/hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/ProcedureEvent.java#L47-L60]
{code:java}
/** Mark the event as not ready. */
  public synchronized void suspend() {
    ready = false;
    if (LOG.isTraceEnabled()) {
      LOG.trace("Suspend " + toString());
    }
  }
  /**
   * Returns true if event is not ready and adds procedure to suspended queue, 
else returns false.
   */
  public synchronized boolean suspendIfNotReady(Procedure proc) {
    if (!ready) {
      suspendedProcedures.addLast(proc);
    }
    return !ready;
  }
{code}
 

Notice In event.suspendIfNotReady(proc) we add the procedure to 
suspendedProcedure list only if read is false.

 

*Flow of release the worker*

Once a SplitWalProcedure got completed and a worker get freed, we wake this 
event and mark ready true. 
([code|https://github.com/apache/hbase/blob/branch-2.5/hbase-server/src/main/java/org/apache/hadoop/hbase/master/SplitWALManager.java#L233-L237])
{code:java}
    public void wake(MasterProcedureScheduler scheduler) {
      if (!event.isReady()) {
        event.wake(scheduler);
      }
    }{code}
 

The suspend() and wake() methods in SplitWorkerAssigner are not synchronized, 
allowing this race condition.

  Race Condition Timeline:
  Thread-1 (Suspending Procedure):           Thread-2 (Releasing Worker):
  1. suspend() called 
  2. event.suspend() called
     → Sets ready = false
  3. [THREAD PAUSES before next line]        4. wake() called 
                                            5. if (!event.isReady()) → TRUE
                                            6. event.wake(scheduler) caleed
                                               → Sets ready = true
                                               → Wakes suspended queue (empty!)
  7. event.suspendIfNotReady(proc) at line 232
     → Checks ready = true
     → Does NOT add procedure to suspended queue
  8. ProcedureExecuter consdier this procedure marked as SUSPENDED. Now there 
is not way to add this proceudre back to schedulerQueue.
     → NEVER WOKEN UP!

 

In 2.6 branch when we introduce WorkerAssigner.java as part of 
SnapshotProcedure where we kept these methods syschronized that solves the 
problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to