This is an automated email from the ASF dual-hosted git repository. lhotari pushed a commit to branch lh-fix-auditor-race-condition in repository https://gitbox.apache.org/repos/asf/bookkeeper.git
commit 4833e97ed31ed1744e51f778c56477a995992216 Author: Lari Hotari <[email protected]> AuthorDate: Fri Mar 13 15:39:50 2026 +0200 [FIX] submitAuditTask() must run full audit to detect bookies not in knownBookies submitAuditTask() was submitting runAuditTask(), which compares pendingWritableBookies against knownBookies to detect lost bookies. knownBookies is seeded from admin.getAllBookies() at auditor startup, so any bookie that registers after the auditor started and dies before being processed by runAuditTask() is never added to knownBookies and remains invisible to the subtraction-based loss detection. This caused testEmptyLedgerLosesQuorumEventually to fail when run after testNoSuchLedgerExists: bookies 33105 and 36375 registered after auditor 42899 started, two watcher tasks were queued (one per registration), and by the time they ran (after checkAllLedgers completed 112ms later) bookie 36375 had already died. pendingWritableBookies correctly reflected {42899,33105} but since 36375 was never in knownBookies the loss went undetected. The explicit submitAuditTask().get() call in the test suffered the same gap. Fix: submitAuditTask() now submits auditorBookieCheckTask.startAudit(false) instead of runAuditTask(). The full auditBookies() scan uses ledger metadata to find all bookies that own ledgers and checks their availability, detecting 36375 via the ledger it appears in regardless of knownBookies state. The runAuditTask() / knownBookies mechanism is still correct and sufficient for its intended purpose: detecting losses for bookies that were already known. The register-and-die edge case is an inherent limitation of that lightweight approach and is handled by the periodic auditorBookieCheckTask and by explicit submitAuditTask() calls. Co-Authored-By: Claude Sonnet 4.6 <[email protected]> --- .../src/main/java/org/apache/bookkeeper/replication/Auditor.java | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/bookkeeper-server/src/main/java/org/apache/bookkeeper/replication/Auditor.java b/bookkeeper-server/src/main/java/org/apache/bookkeeper/replication/Auditor.java index 5d76aaedf6..2be82159e7 100644 --- a/bookkeeper-server/src/main/java/org/apache/bookkeeper/replication/Auditor.java +++ b/bookkeeper-server/src/main/java/org/apache/bookkeeper/replication/Auditor.java @@ -283,8 +283,13 @@ public class Auditor implements AutoCloseable { } /** - * Submit an audit task unconditionally. Used by tests and by the + * Submit a full bookie-check audit task unconditionally. Used by tests and by the * LostBookieRecoveryDelay-changed event handler. + * + * <p>Runs the full {@code auditBookies()} scan (which uses ledger metadata) rather + * than the lightweight {@link #runAuditTask()}, so that bookies which registered + * and died since the auditor started are correctly detected even if they were never + * added to {@code knownBookies}. */ @VisibleForTesting synchronized Future<?> submitAuditTask() { @@ -293,7 +298,7 @@ public class Auditor implements AutoCloseable { f.setException(new BKAuditException("Auditor shutting down")); return f; } - return executor.submit(this::runAuditTask); + return executor.submit(() -> auditorBookieCheckTask.startAudit(false)); } /**
