lhotari commented on PR #25793:
URL: https://github.com/apache/pulsar/pull/25793#issuecomment-4606298536

   The following are findings from a local **Claude Code** review of this PR 
(human-reviewed before posting). Sharing them here in case they're useful — 
please treat them as review input, not blockers to be applied verbatim.
   
   ## Summary
   
   The change does what it claims: automatic offload triggers (from 
`ledgerClosed` and factory init) are funneled through the 
`AUTOMATIC_OFFLOAD_TRIGGER` identity sentinel and gated by two `AtomicBoolean`s 
so at most one automatic offload runs with at most one pending rerun coalesced 
behind it. Explicit/manual callers keep their previous 
`CompletableFuture<Position>` semantics, and the `getOffloadThresholds()` 
extraction + `getOffloadPoliciesIfAppendable()` cleanup correctly remove the 
duplicate `getOffloadPolicies()` calls. Intent matches #25859 well.
   
   There is one concurrency race worth addressing, plus a few 
robustness/quality notes.
   
   ## 1. (Bug) Lost-trigger race between `finishAutomaticOffload` and 
`maybeOffloadInBackground`
   
   The two flags are read/written in a non-atomic check-then-act order:
   
   ```
   // finishAutomaticOffload (offload finishing)        // 
maybeOffloadInBackground (new trigger arrives)
   A: automaticOffloadInProgress.set(false);            C: if 
(!inProgress.compareAndSet(false,true)) {
   B: if (rerunRequested.getAndSet(false)) { rerun }    D:     
rerunRequested.set(true); return; }
   ```
   
   Consider the interleaving `C(fail) → A → B(reads false) → D`:
   
   1. Producer `C`: CAS fails because `inProgress` is still `true` (offload 
still running) → producer will take the `rerunRequested.set(true)` branch.
   2. Consumer `A`: `inProgress = false`.
   3. Consumer `B`: `getAndSet(false)` reads `false` (producer hasn't reached 
`D` yet) → **no rerun scheduled**.
   4. Producer `D`: `rerunRequested = true`.
   
   Final state: `inProgress = false`, `rerunRequested = true`, **no active 
runner**. The coalesced follow-up pass is dropped; it only self-heals on the 
*next* external automatic trigger (next `ledgerClosed`), whose 
`finishAutomaticOffload` then observes the stale `rerunRequested` and does a 
(now likely redundant) pass.
   
   Impact is bounded for a continuously-rolling topic, but for a topic that 
goes idle right after the race, threshold-eligible ledgers can remain 
un-offloaded until the next write/rollover. This breaks the stated invariant 
from #25859 ("After the current automatic offload completes, one follow-up pass 
should run if any trigger arrived meanwhile").
   
   The added tests are deterministic and exercise only the happy path, so none 
reproduces this interleaving. Suggested direction: a single state machine (e.g. 
an `AtomicInteger`/enum state `IDLE → RUNNING → RUNNING_PENDING`) or a 
re-check-and-re-acquire loop in `finishAutomaticOffload` after clearing 
`inProgress`, validated with a stress / `invocationCount` concurrency test.
   
   ## 2. (Quality) Automatic offload failures are now logged at WARN where they 
were previously silenced
   
   Previously every automatic trigger shared the already-completed 
`NULL_OFFLOAD_PROMISE`, so `completeExceptionally(...)` in `maybeOffload` (e.g. 
"offloadPolicies is NULL", thresholds `< 0`) was a no-op and produced no log. 
Now each automatic run gets a fresh completion that drives 
`log.warn(...).log("Failed to automatically offload ledgers")`. This is 
arguably an improvement in visibility, but it is a behavior change: races where 
the offloader is reconfigured/disabled between scheduling and execution will 
now emit WARN noise that didn't exist before. Worth a conscious decision 
(possibly DEBUG, or filtering the "disabled mid-flight" case).
   
   ## 3. (Quality) Any path that fails to complete the internal completion 
future strands `automaticOffloadInProgress = true`
   
   `automaticOffloadInProgress` is only cleared via `finishAutomaticOffload`, 
which only runs when `automaticOffloadCompletion` completes. If 
`executor.execute(...)` throws (e.g. `RejectedExecutionException` on a 
shutting-down executor), the completion is never finished and automatic offload 
is **permanently disabled for that managed ledger**. The old code had no such 
latch — a rejected execute simply dropped one trigger. Low severity (mainly 
shutdown), but a `try/finally` or completing the internal future on the 
rejection path would make it robust.
   
   ## 4. (Quality) Tests rely on timing and don't cover the concurrency race
   
   `automaticOffloadTriggersAreCoalescedWhileOffloadInProgress` uses 
`Thread.sleep(300)` plus an `offloadPolicyCalls < baseline + 5` heuristic — 
fragile under load and not a tight assertion. 
`automaticOffloadRunsAgainForCoalescedTrigger` is the valuable one (verifies 
the follow-up pass picks up the second ledger), but it is deterministic and 
cannot catch finding #1. Consider an explicit concurrency test (many threads 
hammering `maybeOffloadInBackground` while an offload is in flight) to guard 
the coalescing invariant.
   
   ## Minor / non-blocking
   
   - `AUTOMATIC_OFFLOAD_TRIGGER` is a process-wide `public static final` 
sentinel compared by `==` identity across all `ManagedLedgerImpl` instances. 
This is correct (its `Position` value is never consumed; per-instance state 
lives in the `AtomicBoolean`s) and the renamed name + comment make the intent 
clear.
   
   Overall: addressing #1 before merge would be the main ask, since it 
undermines the core guarantee the PR exists to provide; #2–#4 are 
quality/robustness improvements.
   
   <sub>🤖 Generated with a local Claude Code review.</sub>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to