Umesh Kumar Kumawat created HBASE-29713:
-------------------------------------------
Summary: HMaster loosing track of SplitWalProcedure because race
condition while suspending and waking up the procedure
Key: HBASE-29713
URL: https://issues.apache.org/jira/browse/HBASE-29713
Project: HBase
Issue Type: Bug
Components: proc-v2
Affects Versions: 2.5.13, 2.5.0
Reporter: Umesh Kumar Kumawat
When there are too many splitWalProecedure at a time and there are no worker
available we suspend the procedure that try to acquire the worker and then wake
the procedure again once some procedure get completed and it releases the
worker.
([code|https://github.com/apache/hbase/blob/branch-2.5/hbase-server/src/main/java/org/apache/hadoop/hbase/master/SplitWALManager.java#L155-L164])
{code:java}
public ServerName acquireSplitWALWorker(Procedure<?> procedure)
throws ProcedureSuspendedException {
Optional<ServerName> worker = splitWorkerAssigner.acquire();
if (worker.isPresent()) {
LOG.debug("Acquired split WAL worker={}", worker.get());
return worker.get();
}
splitWorkerAssigner.suspend(procedure);
throw new ProcedureSuspendedException();
}{code}
[Definition|https://github.com/apache/hbase/blob/branch-2.5/hbase-server/src/main/java/org/apache/hadoop/hbase/master/SplitWALManager.java#L228C1-L231C6]
of splitWorkerAssigner.suspend(procedure)
{code:java}
public void suspend(Procedure<?> proc) {
event.suspend();
event.suspendIfNotReady(proc);
}
{code}
[Definition of both event.suspend() and
event.suspendIfNotReady(proc)|https://github.com/apache/hbase/blob/branch-2.5/hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/ProcedureEvent.java#L47-L60]
{code:java}
/** Mark the event as not ready. */
public synchronized void suspend() {
ready = false;
if (LOG.isTraceEnabled()) {
LOG.trace("Suspend " + toString());
}
}
/**
* Returns true if event is not ready and adds procedure to suspended queue,
else returns false.
*/
public synchronized boolean suspendIfNotReady(Procedure proc) {
if (!ready) {
suspendedProcedures.addLast(proc);
}
return !ready;
}
{code}
Notice In event.suspendIfNotReady(proc) we add the procedure to
suspendedProcedure list only if read is false.
*Flow of release the worker*
Once a SplitWalProcedure got completed and a worker get freed, we wake this
event and mark ready true.
([code|https://github.com/apache/hbase/blob/branch-2.5/hbase-server/src/main/java/org/apache/hadoop/hbase/master/SplitWALManager.java#L233-L237])
{code:java}
public void wake(MasterProcedureScheduler scheduler) {
if (!event.isReady()) {
event.wake(scheduler);
}
}{code}
The suspend() and wake() methods in SplitWorkerAssigner are not synchronized,
allowing this race condition.
Race Condition Timeline:
Thread-1 (Suspending Procedure): Thread-2 (Releasing Worker):
1. suspend() called
2. event.suspend() called
→ Sets ready = false
3. [THREAD PAUSES before next line] 4. wake() called
5. if (!event.isReady()) → TRUE
6. event.wake(scheduler) caleed
→ Sets ready = true
→ Wakes suspended queue (empty!)
7. event.suspendIfNotReady(proc) at line 232
→ Checks ready = true
→ Does NOT add procedure to suspended queue
8. ProcedureExecuter consdier this procedure marked as SUSPENDED. Now there
is not way to add this proceudre back to schedulerQueue.
→ NEVER WOKEN UP!
In 2.6 branch when we introduce WorkerAssigner.java as part of
SnapshotProcedure where we kept these methods syschronized that solves the
problem.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)