lhotari commented on PR #25793:
URL: https://github.com/apache/pulsar/pull/25793#issuecomment-4606298536
The following are findings from a local **Claude Code** review of this PR
(human-reviewed before posting). Sharing them here in case they're useful —
please treat them as review input, not blockers to be applied verbatim.
## Summary
The change does what it claims: automatic offload triggers (from
`ledgerClosed` and factory init) are funneled through the
`AUTOMATIC_OFFLOAD_TRIGGER` identity sentinel and gated by two `AtomicBoolean`s
so at most one automatic offload runs with at most one pending rerun coalesced
behind it. Explicit/manual callers keep their previous
`CompletableFuture<Position>` semantics, and the `getOffloadThresholds()`
extraction + `getOffloadPoliciesIfAppendable()` cleanup correctly remove the
duplicate `getOffloadPolicies()` calls. Intent matches #25859 well.
There is one concurrency race worth addressing, plus a few
robustness/quality notes.
## 1. (Bug) Lost-trigger race between `finishAutomaticOffload` and
`maybeOffloadInBackground`
The two flags are read/written in a non-atomic check-then-act order:
```
// finishAutomaticOffload (offload finishing) //
maybeOffloadInBackground (new trigger arrives)
A: automaticOffloadInProgress.set(false); C: if
(!inProgress.compareAndSet(false,true)) {
B: if (rerunRequested.getAndSet(false)) { rerun } D:
rerunRequested.set(true); return; }
```
Consider the interleaving `C(fail) → A → B(reads false) → D`:
1. Producer `C`: CAS fails because `inProgress` is still `true` (offload
still running) → producer will take the `rerunRequested.set(true)` branch.
2. Consumer `A`: `inProgress = false`.
3. Consumer `B`: `getAndSet(false)` reads `false` (producer hasn't reached
`D` yet) → **no rerun scheduled**.
4. Producer `D`: `rerunRequested = true`.
Final state: `inProgress = false`, `rerunRequested = true`, **no active
runner**. The coalesced follow-up pass is dropped; it only self-heals on the
*next* external automatic trigger (next `ledgerClosed`), whose
`finishAutomaticOffload` then observes the stale `rerunRequested` and does a
(now likely redundant) pass.
Impact is bounded for a continuously-rolling topic, but for a topic that
goes idle right after the race, threshold-eligible ledgers can remain
un-offloaded until the next write/rollover. This breaks the stated invariant
from #25859 ("After the current automatic offload completes, one follow-up pass
should run if any trigger arrived meanwhile").
The added tests are deterministic and exercise only the happy path, so none
reproduces this interleaving. Suggested direction: a single state machine (e.g.
an `AtomicInteger`/enum state `IDLE → RUNNING → RUNNING_PENDING`) or a
re-check-and-re-acquire loop in `finishAutomaticOffload` after clearing
`inProgress`, validated with a stress / `invocationCount` concurrency test.
## 2. (Quality) Automatic offload failures are now logged at WARN where they
were previously silenced
Previously every automatic trigger shared the already-completed
`NULL_OFFLOAD_PROMISE`, so `completeExceptionally(...)` in `maybeOffload` (e.g.
"offloadPolicies is NULL", thresholds `< 0`) was a no-op and produced no log.
Now each automatic run gets a fresh completion that drives
`log.warn(...).log("Failed to automatically offload ledgers")`. This is
arguably an improvement in visibility, but it is a behavior change: races where
the offloader is reconfigured/disabled between scheduling and execution will
now emit WARN noise that didn't exist before. Worth a conscious decision
(possibly DEBUG, or filtering the "disabled mid-flight" case).
## 3. (Quality) Any path that fails to complete the internal completion
future strands `automaticOffloadInProgress = true`
`automaticOffloadInProgress` is only cleared via `finishAutomaticOffload`,
which only runs when `automaticOffloadCompletion` completes. If
`executor.execute(...)` throws (e.g. `RejectedExecutionException` on a
shutting-down executor), the completion is never finished and automatic offload
is **permanently disabled for that managed ledger**. The old code had no such
latch — a rejected execute simply dropped one trigger. Low severity (mainly
shutdown), but a `try/finally` or completing the internal future on the
rejection path would make it robust.
## 4. (Quality) Tests rely on timing and don't cover the concurrency race
`automaticOffloadTriggersAreCoalescedWhileOffloadInProgress` uses
`Thread.sleep(300)` plus an `offloadPolicyCalls < baseline + 5` heuristic —
fragile under load and not a tight assertion.
`automaticOffloadRunsAgainForCoalescedTrigger` is the valuable one (verifies
the follow-up pass picks up the second ledger), but it is deterministic and
cannot catch finding #1. Consider an explicit concurrency test (many threads
hammering `maybeOffloadInBackground` while an offload is in flight) to guard
the coalescing invariant.
## Minor / non-blocking
- `AUTOMATIC_OFFLOAD_TRIGGER` is a process-wide `public static final`
sentinel compared by `==` identity across all `ManagedLedgerImpl` instances.
This is correct (its `Position` value is never consumed; per-instance state
lives in the `AtomicBoolean`s) and the renamed name + comment make the intent
clear.
Overall: addressing #1 before merge would be the main ask, since it
undermines the core guarantee the PR exists to provide; #2–#4 are
quality/robustness improvements.
<sub>🤖 Generated with a local Claude Code review.</sub>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]