lhotari commented on issue #25294:
URL: https://github.com/apache/pulsar/issues/25294#issuecomment-4698703441
Thanks for the very detailed report — the offset/heap details made this much
easier to narrow down.
I believe this is the same class of stuck-reader bug that was just fixed in
#25998 ("Fix compacted read could be stuck forever or message loss due to
cursor mark delete"), merged on June 12.
**Why it matches**
The `__change_events` topic-policy reader is created with
`readCompacted(true)`, as a non-durable `Exclusive` subscription, and it
acknowledges **cumulatively on every read**. That's exactly the combination
#25998 guards against:
When the reader reconnects to the new owner broker after the unload, its
grouped cumulative ack (default `acknowledgmentGroupTime` = 100ms) for a
message id whose ledger has already been trimmed/compacted away gets flushed to
the new owner. The broker calls `cursor.asyncMarkDelete(...)` on that position,
which advances the cursor's **read position to the next valid managed-ledger
position — past the compaction horizon**. From then on the read bypasses the
compacted ledger and goes straight to the (empty) managed cursor, so
`hasMessageAvailable()` stays `true` while `readNext()` never returns. The
reader is stuck at that offset forever, and topic-policy initialization (hence
topic loading) for the namespace never completes. That lines up with your steps
6–12: the reconnect/resubscribe succeeds, then the reader is stuck at offset
`x` below the compaction horizon `z`, while healthy readers sit at `y = z`.
#25998 fixes this on the broker side: for a compacted, non-durable,
single-active-consumer subscription it now ignores a cumulative ack whose
ledger no longer exists in the managed ledger, so the mark-delete can't jump
the read position past the horizon.
**One thing to confirm:** was a ledger on `__change_events` trimmed/removed
(retention or compaction) around the two ownership changes? #25998 specifically
addresses the "ack a position in an already-removed ledger" trigger. If no
ledger was removed, the stall may have a different cause and we'd want to dig
further.
**Versions:** per the [release
policy](https://pulsar.apache.org/contribute/release-policy/), 3.0.x is
end-of-support and won't receive this fix. The fix is included in the 4.0.12
(LTS), 4.2.3, and 5.0.0-M1 releases — please upgrade to the latest 4.0.x LTS to
pick it up.
**On the `initPolicesCache` timeout you proposed:** good suggestion, and
it's still worth doing as defense-in-depth — #25998 fixes this particular
stuck-reader trigger but doesn't bound the policy-cache read loop, so any
*other* future stuck-reader cause could still hang loading. Note there's
already a 60s `topicLoadTimeoutSeconds` on the topic-load future, but its
timeout handler only logs — it doesn't evict the poisoned per-namespace entry
from `policyCacheInitMap` or close the stuck reader, which is why the whole
namespace stays unavailable until a restart. I'll submit a separate PR for that
(timeout + cleanup of the poisoned cache entry / stuck reader, plus metrics).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]