Thanks for bringing up a real problem and driving the work to solve this issue.

I'd suggest analyzing 3 alternative designs before deciding on the solution.

Alternative 1:
I'd suggest looking into an alternative design that achieves the same outcome 
of allowing the subscription cursor to advance. Instead of making copies of the 
messages, an alternative design would be to create another subscription to 
track the slow or hot keys. Essentially, the design could be very similar to 
diverting to the overflow managed ledger, but there wouldn't be a need to 
duplicate the data and get into a situation where different failure modes cause 
unnecessary complications.

Alternative 2:
Simply optimize the replay queue solution together with improving the 
scalability of individualDeletedMessages so that it scales to 1,000,000 ack 
holes and beyond. This would result in the simplest solution, which would cover 
most use cases. There are multiple benefits to keeping the solution simple. For 
example, backlog management doesn't change.

Together with the PIP-430 broker cache (since 4.1.0), the replay queue solution 
already avoids most unnecessary BK reads when the broker cache is sufficiently 
tuned for high-scale use cases. The PIP-430 broker cache could be improved 
further to achieve high cache hit rates if it turns out to be a problem.

Alternative 3:
The client-side code could simply route to a separate topic on its own when it 
detects a hot key and acknowledge the original message.

Regarding Alternative 2, I believe that individualDeletedMessages can already 
scale to 1,000,000 ack holes and beyond when the broker is properly configured. 
It could be tested with this type of configuration:

managedLedgerMaxUnackedRangesToPersist=1000000
managedLedgerMaxBatchDeletedIndexToPersist=1000000
managedLedgerPersistIndividualAckAsLongArray=true
managedCursorInfoCompressionType=LZ4
managedLedgerInfoCompressionType=LZ4

(The last config is unrelated, but it makes sense to also switch to using 
compression.)

I hope you could also analyze these alternatives before we proceed with making 
the decision on solving the hot (or slow) key problem. Thank you for focusing 
on solving this problem!

-Lari

On 2026/05/07 05:18:35 xiangying meng wrote:
> Hi all,
> 
> I'd like to propose PIP-474: Key_Shared Hot Key Overflow Mechanism.
> 
> Key_Shared is Pulsar's only built-in solution for parallel consumption
> with per-key ordering. But it has a critical production issue: a
> single stuck consumer can starve ALL other keys across ALL partitions
> within minutes, due to the containsStickyKeyHash ordering check
> flooding the Replay queue.
> 
> This becomes especially urgent as AI inference workloads adopt MQ as
> their transport layer — slow consumption (seconds per request) plus
> strict per-key ordering is exactly what Key_Shared is designed for,
> yet the hot-key starvation bug makes it unusable in production.
> 
> PIP-474 proposes diverting hot-key messages to an independent Overflow
> ManagedLedger, unblocking Normal Read and mark-delete advancement
> while preserving at-least-once delivery and per-key ordering. Zero
> overhead when no hot keys are present.
> 
> PIP: https://github.com/apache/pulsar/pull/25706
> 
> Feedback welcome.
> 
> Thanks, Xiangying Meng
> 

Reply via email to