lhotari commented on PR #25706: URL: https://github.com/apache/pulsar/pull/25706#issuecomment-4406555710
Responding to the previous comment here. > Thank you for your feedback. I'll keep my response brief to avoid making the thread too long for others to review and share their opinions. > > **Alt 2** does not solve any of the core problems — it only delays their onset. The argument "if it survives the peak, it's enough" does not hold for sustained hot key scenarios. In production, we cannot bet that "sustained slow-consumption hot keys will never occur," especially for critical business workloads. This would undermine users' confidence in Pulsar. As a general thought, I think building a system that covers 100% of all possible use cases would add complexity, impact reliability, and increase the risk of incidents. For the minority of use cases that aren't covered by the current solution, I believe we can find ways to mitigate the possible issues. That's why I'd lean toward prioritizing improvements to the existing architecture over adding the solution proposed in this PR. As I mentioned earlier, it's already possible to scale the individual ack holes solution to 1M entries, which would help mitigate the issues further. Storing individual acks could also be improved: the current algorithm of storing everything each time could be changed to an incremental model where only the new acks are stored. Roaring Bitmap is very efficient at performing an AND over multiple separate bitmaps, so this should work nicely. There's nothing preventing us from implementing a solution where there wouldn't be a practical limit on the number of ack holes (say, when the limit is 10M or 100M). There's already an accepted PIP for this: [PIP-381: Handle large PositionInfo state](https://github.com/apache/pulsar/blob/master/pip/pip-381.md). The PIP-381 implementation wasn't merged because of a previous alternative implementation, [PR 9292](https://github.com/apache/pulsar/pull/9292), which is now the preferred implementation. One notable detail is that PR 9292 is disabled by default in Pulsar 4.0 (managedLedgerPersistIndividualAckAsLongArray=false). In Pulsar 4.0 needs to be separately enabled with managedLedgerPersistIndividualAckAsLongArray=true. The existing PR 9292 implementation and the PIP-381 design could be further improved to support a "limitless" implementation that incrementally stores the individual acks into BookKeeper. On the client application side, for use cases with hot keys, I think there could be ways to identify which keys tend to be hot. In those cases, the application could route the hot keys to separate topics already at produce time. The same could also be done in a Pulsar Function (or another application) that splits hot keys to another topic, although that wouldn't be as efficient as splitting at produce time. As I mentioned earlier, on the client side there are also multiple ways to scale consumers so they can handle hot/slow keys while still processing other keys. The [virtual threads MessageListener example](https://github.com/lhotari/dss25-mastering-key-ordered-demo/blob/740dd8d1b350b8201266fb62d0456f44806299ba/pulsar-listener/src/main/java/com/github/lhotari/dss25/listener/PulsarListenerApp.java#L145-L188) is one nice possibility. > **Alt 1** and **Overflow ML** both solve the core problems. Alt 1's cost is storage amplification (the auxiliary cursor's mark-delete cannot advance → entire ledgers are retained) + longer broker restart recovery time (the auxiliary cursor must replay from far behind). Overflow ML's cost is a secondary BK write for hot key data. This is already optimized so that already acknowledged messages are skipped even when a cursor replays from far behind. > **Alt 3** is the weakest — the victim (stuck consumer) cannot self-rescue. > > I still lean toward **Overflow ML** because it addresses all the core problems. Its cost — writing hot key data to a secondary BK ledger — can be managed through disk capacity planning and expansion. Hot keys may persist for a long time, but their data volume is typically a small fraction of total traffic. Meanwhile, its backlog and consumption progress metrics remain clean and straightforward for operators. > > Alt 1's storage amplification can also be addressed via disk expansion, but the longer broker restart recovery time is a harder trade-off. Alt 1 could also expose additional metrics for backlog and consumption progress, but that seems more complex and less user-friendly. Overall, Overflow ML provides cleaner operational visibility. > > Looking forward to hearing more voices and feedback. I agree that it feels cleaner as a plan at first thought. The main reason for my pushback is that adding complexity in a distributed system usually impacts reliability by introducing new failure modes. That's why I'm pushing back, so that we also look into other solutions that could improve the existing architecture and still meet the main goal of this PIP. First, we should agree on the problem statement and address the problems specifically: - hot or slow keys - will cause head-of-line blocking for other keys assigned to the consumer, due to either serial processing in the consumer or the consumer eventually running out of permits - this will result in "ack holes", which increase the individual ack size - the number of "ack holes" gets amplified by head-of-line blocking, since the keys blocked behind a hot or slow key will also remain unacked - when the persistent limit for individual acks is reached, message delivery will eventually stop for all consumers By taking a look at each problem and listing the possible alternatives, we could make improvements that result in a completely different solution than the "Overflow ML" one. There could be better ways to address the root causes of the problem with minimal added complexity. Besides solving the hot/slow key problems for Key_Shared, there could be broader benefits if we improve individual ack handling and address head-of-line blocking issues that possibly impact other subscription types (Shared, Exclusive, Failover) besides Key_Shared. Head-of-line blocking is also a real issue for Shared subscriptions. In extreme cases, users currently work around it for Shared subscriptions by setting the receiver queue size to 0 or 1. A very small receiver queue size kills performance and doesn't work efficiently with workloads that mix fast and slow consumers, or with workloads containing messages whose processing durations vary widely. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
