This is an automated email from the ASF dual-hosted git repository.

lhotari pushed a commit to branch lh-fix-auditor-race-condition
in repository https://gitbox.apache.org/repos/asf/bookkeeper.git

commit 4833e97ed31ed1744e51f778c56477a995992216
Author: Lari Hotari <[email protected]>
AuthorDate: Fri Mar 13 15:39:50 2026 +0200

    [FIX] submitAuditTask() must run full audit to detect bookies not in 
knownBookies
    
    submitAuditTask() was submitting runAuditTask(), which compares
    pendingWritableBookies against knownBookies to detect lost bookies.
    knownBookies is seeded from admin.getAllBookies() at auditor startup, so
    any bookie that registers after the auditor started and dies before being
    processed by runAuditTask() is never added to knownBookies and remains
    invisible to the subtraction-based loss detection.
    
    This caused testEmptyLedgerLosesQuorumEventually to fail when run after
    testNoSuchLedgerExists: bookies 33105 and 36375 registered after auditor
    42899 started, two watcher tasks were queued (one per registration), and
    by the time they ran (after checkAllLedgers completed 112ms later) bookie
    36375 had already died. pendingWritableBookies correctly reflected
    {42899,33105} but since 36375 was never in knownBookies the loss went
    undetected. The explicit submitAuditTask().get() call in the test suffered
    the same gap.
    
    Fix: submitAuditTask() now submits auditorBookieCheckTask.startAudit(false)
    instead of runAuditTask(). The full auditBookies() scan uses ledger metadata
    to find all bookies that own ledgers and checks their availability, 
detecting
    36375 via the ledger it appears in regardless of knownBookies state.
    
    The runAuditTask() / knownBookies mechanism is still correct and sufficient
    for its intended purpose: detecting losses for bookies that were already
    known. The register-and-die edge case is an inherent limitation of that
    lightweight approach and is handled by the periodic auditorBookieCheckTask
    and by explicit submitAuditTask() calls.
    
    Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
---
 .../src/main/java/org/apache/bookkeeper/replication/Auditor.java | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git 
a/bookkeeper-server/src/main/java/org/apache/bookkeeper/replication/Auditor.java
 
b/bookkeeper-server/src/main/java/org/apache/bookkeeper/replication/Auditor.java
index 5d76aaedf6..2be82159e7 100644
--- 
a/bookkeeper-server/src/main/java/org/apache/bookkeeper/replication/Auditor.java
+++ 
b/bookkeeper-server/src/main/java/org/apache/bookkeeper/replication/Auditor.java
@@ -283,8 +283,13 @@ public class Auditor implements AutoCloseable {
     }
 
     /**
-     * Submit an audit task unconditionally. Used by tests and by the
+     * Submit a full bookie-check audit task unconditionally. Used by tests 
and by the
      * LostBookieRecoveryDelay-changed event handler.
+     *
+     * <p>Runs the full {@code auditBookies()} scan (which uses ledger 
metadata) rather
+     * than the lightweight {@link #runAuditTask()}, so that bookies which 
registered
+     * and died since the auditor started are correctly detected even if they 
were never
+     * added to {@code knownBookies}.
      */
     @VisibleForTesting
     synchronized Future<?> submitAuditTask() {
@@ -293,7 +298,7 @@ public class Auditor implements AutoCloseable {
             f.setException(new BKAuditException("Auditor shutting down"));
             return f;
         }
-        return executor.submit(this::runAuditTask);
+        return executor.submit(() -> auditorBookieCheckTask.startAudit(false));
     }
 
     /**

Reply via email to