Re: [I] [Bug] topic unavailable because topic policy cache loading reader is stuck [pulsar]

via GitHub Sat, 13 Jun 2026 06:48:47 -0700


lhotari commented on issue #25294:
URL: https://github.com/apache/pulsar/issues/25294#issuecomment-4698703441


   Thanks for the very detailed report — the offset/heap details made this much 
easier to narrow down.
   
   I believe this is the same class of stuck-reader bug that was just fixed in 
#25998 ("Fix compacted read could be stuck forever or message loss due to 
cursor mark delete"), merged on June 12.
   
   **Why it matches**
   
   The `__change_events` topic-policy reader is created with 
`readCompacted(true)`, as a non-durable `Exclusive` subscription, and it 
acknowledges **cumulatively on every read**. That's exactly the combination 
#25998 guards against:
   
   When the reader reconnects to the new owner broker after the unload, its 
grouped cumulative ack (default `acknowledgmentGroupTime` = 100ms) for a 
message id whose ledger has already been trimmed/compacted away gets flushed to 
the new owner. The broker calls `cursor.asyncMarkDelete(...)` on that position, 
which advances the cursor's **read position to the next valid managed-ledger 
position — past the compaction horizon**. From then on the read bypasses the 
compacted ledger and goes straight to the (empty) managed cursor, so 
`hasMessageAvailable()` stays `true` while `readNext()` never returns. The 
reader is stuck at that offset forever, and topic-policy initialization (hence 
topic loading) for the namespace never completes. That lines up with your steps 
6–12: the reconnect/resubscribe succeeds, then the reader is stuck at offset 
`x` below the compaction horizon `z`, while healthy readers sit at `y = z`.
   
   #25998 fixes this on the broker side: for a compacted, non-durable, 
single-active-consumer subscription it now ignores a cumulative ack whose 
ledger no longer exists in the managed ledger, so the mark-delete can't jump 
the read position past the horizon.
   
   **One thing to confirm:** was a ledger on `__change_events` trimmed/removed 
(retention or compaction) around the two ownership changes? #25998 specifically 
addresses the "ack a position in an already-removed ledger" trigger. If no 
ledger was removed, the stall may have a different cause and we'd want to dig 
further.
   
   **Versions:** per the [release 
policy](https://pulsar.apache.org/contribute/release-policy/), 3.0.x is 
end-of-support and won't receive this fix. The fix is included in the 4.0.12 
(LTS), 4.2.3, and 5.0.0-M1 releases — please upgrade to the latest 4.0.x LTS to 
pick it up.
   
   **On the `initPolicesCache` timeout you proposed:** good suggestion, and 
it's still worth doing as defense-in-depth — #25998 fixes this particular 
stuck-reader trigger but doesn't bound the policy-cache read loop, so any 
*other* future stuck-reader cause could still hang loading. Note there's 
already a 60s `topicLoadTimeoutSeconds` on the topic-load future, but its 
timeout handler only logs — it doesn't evict the poisoned per-namespace entry 
from `policyCacheInitMap` or close the stuck reader, which is why the whole 
namespace stays unavailable until a restart. I'll submit a separate PR for that 
(timeout + cleanup of the poisoned cache entry / stuck reader, plus metrics).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Bug] topic unavailable because topic policy cache loading reader is stuck [pulsar]

Reply via email to