Re: [I] [improve][broker] Speed up PendingAcksMap ack tracking with primitive storage and 50%+ lower memory [pulsar]

via GitHub Sun, 14 Jun 2026 01:52:51 -0700


void-ptr974 commented on issue #26027:
URL: https://github.com/apache/pulsar/issues/26027#issuecomment-4701242613


   Benchmark notes from the current draft PR: 
https://github.com/apache/pulsar/pull/26024
   
   These numbers are intended to compare the current nested-map layout 
(`oldProduction`) with the draft primitive layout (`production`) in the same 
local run. They should be read as local data-structure cost, not end-to-end 
broker latency.
   
   The benchmark datasets are tied to existing defaults:
   
   - `maxUnackedMessagesPerConsumer = 50000`
   - `maxUnackedMessagesPerSubscription = 200000`
   - `receiverQueueSize = 1000`
   
   The `50k entries / N ledgers` cases represent one consumer near the default 
per-consumer unacked limit, with the pending window distributed across 
different numbers of managed-ledger ledgers.
   
   ### JMH Time
   
   Steady dispatch + ack cycle:
   
   | Dataset | Old | New | Change |
   |---|---:|---:|---:|
   | 50k entries / 1 ledger | 204.8 ns/op | 192.9 ns/op | -5.8% |
   | 50k entries / 5 ledgers | 181.1 ns/op | 129.5 ns/op | -28.5% |
   | 50k entries / 10 ledgers | 189.3 ns/op | 172.8 ns/op | -8.7% |
   | 50k entries / 20 ledgers | 265.9 ns/op | 163.7 ns/op | -38.4% |
   
   Dispatch + ack cycle with periodic partial ack:
   
   | Dataset | Old | New | Change |
   |---|---:|---:|---:|
   | 50k entries / 1 ledger | 216.2 ns/op | 158.4 ns/op | -26.7% |
   | 50k entries / 5 ledgers | 185.3 ns/op | 126.1 ns/op | -32.0% |
   | 50k entries / 10 ledgers | 178.5 ns/op | 131.5 ns/op | -26.3% |
   | 50k entries / 20 ledgers | 180.1 ns/op | 122.1 ns/op | -32.2% |
   
   `removeAllUpTo` across the mark-delete range:
   
   | Dataset | Old | New | Change |
   |---|---:|---:|---:|
   | 50k entries / 1 ledger | 788.7 us/op | 1903.1 us/op | +141.3% |
   | 50k entries / 5 ledgers | 572.1 us/op | 430.7 us/op | -24.7% |
   | 50k entries / 10 ledgers | 533.4 us/op | 204.1 us/op | -61.7% |
   | 50k entries / 20 ledgers | 487.6 us/op | 122.5 us/op | -74.9% |
   
   Same-ledger small-prefix cleanup:
   
   | Dataset | Old | New | Change |
   |---|---:|---:|---:|
   | 50k entries / 1 ledger | 43.2 us/op | 63.1 us/op | +46.2% |
   | 50k entries / 5 ledgers | 41.0 us/op | 62.0 us/op | +51.3% |
   | 50k entries / 10 ledgers | 37.9 us/op | 59.8 us/op | +58.0% |
   | 50k entries / 20 ledgers | 36.8 us/op | 64.4 us/op | +74.8% |
   
   The main trade-off is the same-ledger prefix cleanup path. The primitive map 
removes the inner `TreeMap.headMap(...).clear()` fast path, so the draft adds a 
per-ledger `BitSet` index to avoid scanning the whole primitive map for common 
prefix-cleanup cases. It still does not beat the old inner `TreeMap` in the 
single-ledger prefix case, but it keeps the regression bounded while preserving 
the gains on common exact-key operations and cross-ledger cleanup.
   
   ### JMH Allocation
   
   Allocation was measured with JMH GC profiler using `gc.alloc.rate.norm` 
(`B/op`).
   
   Focused hot operations:
   
   | Operation | Dataset | Old | New |
   |---|---:|---:|---:|
   | `addOrReplace` | 50k / 1 ledger | 47.939 B/op | 0.000225 B/op |
   | `removeAndAddRemaining` | 50k / 1 ledger | 87.940 B/op | 0.000544 B/op |
   | `updateRemainingUnacked` | 50k / 1 ledger | 47.940 B/op | 0.000257 B/op |
   
   Values below `0.001 B/op` are effectively noise. This is the expected 
allocation effect of removing boxed/object per-entry updates from the hot 
operations.
   
   Rolling consumer window allocation:
   
   | Scenario | Dataset | Old | New | Change |
   |---|---:|---:|---:|---:|
   | dispatch + ack | 50k / 1 ledger | 108.244 B/op | 111.634 B/op | +3.1% |
   | dispatch + ack | 50k / 5 ledgers | 111.705 B/op | 80.141 B/op | -28.3% |
   | dispatch + ack | 50k / 10 ledgers | 111.410 B/op | 80.148 B/op | -28.1% |
   | dispatch + ack | 50k / 20 ledgers | 110.818 B/op | 80.155 B/op | -27.7% |
   | dispatch + ack + partial ack | 50k / 1 ledger | 108.456 B/op | 113.002 
B/op | +4.2% |
   | dispatch + ack + partial ack | 50k / 5 ledgers | 114.686 B/op | 80.138 
B/op | -30.1% |
   | dispatch + ack + partial ack | 50k / 10 ledgers | 114.371 B/op | 80.150 
B/op | -29.9% |
   | dispatch + ack + partial ack | 50k / 20 ledgers | 113.740 B/op | 80.158 
B/op | -29.5% |
   
   The cleanup benchmark allocation numbers are not listed as remove-only 
allocation because that benchmark rebuilds the dataset at invocation setup; the 
reported allocation includes population cost.
   
   ### JOL Footprint
   
   JOL was used to measure the retained object graph for the pending-ack map 
structure. This is not total consumer memory usage; it isolates the data 
structure changed by the draft. The `New` column includes the primitive map and 
the BitSet index.
   
   | Scenario | Old | New | Change |
   |---|---:|---:|---:|
   | 1k entries / 1 ledger | 86.3 KiB | 34.4 KiB | -60.2% |
   | 100 entries / 1 ledger | 8.9 KiB | 4.6 KiB | -48.3% |
   | 25k entries / 1 ledger | 2.10 MiB | 1.06 MiB | -49.4% |
   | 50k entries / 1 ledger | 4.20 MiB | 2.13 MiB | -49.4% |
   | 50k entries / 5 ledgers | 4.20 MiB | 1.33 MiB | -68.3% |
   | 50k entries / 10 ledgers | 4.20 MiB | 1.33 MiB | -68.3% |
   | 50k entries / 20 ledgers | 4.20 MiB | 1.33 MiB | -68.3% |
   | 200k entries / 100 ledgers | 16.80 MiB | 6.66 MiB | -60.4% |
   | 200k entries / 500 ledgers | 16.84 MiB | 8.37 MiB | -50.3% |
   
   Sparse many-ledger cases are less favorable because each non-empty ledger 
bucket has fixed primitive-map/index overhead:
   
   | Scenario | Old | New | Change |
   |---|---:|---:|---:|
   | 1k entries / 100 ledgers | 97.1 KiB | 68.2 KiB | -29.8% |
   | 1k entries / 500 ledgers | 140.8 KiB | 340.1 KiB | +141.5% |
   | 1k entries / 1000 ledgers | 195.5 KiB | 679.9 KiB | +247.7% |
   
   This is expected for a bucketed primitive structure: dense or moderately 
distributed pending windows benefit, while one-entry-per-ledger distributions 
can pay more fixed bucket overhead.
   
   ### Retained Capacity
   
   The primitive arrays do not shrink while a ledger bucket remains non-empty. 
The impact depends on how entries are removed:
   
   | Scenario | Old | New | Notes |
   |---|---:|---:|---|
   | peak 50k / 1 ledger, then clear all | 216 B | 216 B | bucket is removed |
   | peak 50k / 1 ledger, keep 1 entry | 416 B | 2.13 MiB | worst 
retained-capacity shape |
   | peak 50k / 1 ledger, keep 1k entries | 86.3 KiB | 2.13 MiB | same large 
bucket remains |
   | peak 50k / 5 ledgers, keep 1k in last ledger | 86.3 KiB | 272.4 KiB | 
previous ledgers are released |
   | peak 50k / 20 ledgers, keep 1k in last ledger | 86.3 KiB | 68.4 KiB | 
smaller boundary bucket remains |
   
   In normal mark-delete progress, completed ledger buckets are removed and 
their arrays are released. The retained-capacity risk is mainly when a very 
large ledger bucket drains down to a small tail but remains non-empty for a 
long time.
   
   ### Summary
   
   - The common exact-key update operations become effectively allocation-free.
   - Rolling dispatch/ack allocation improves by about 28-30% when the default 
50k pending window spans 5+ ledgers.
   - Retained object footprint drops by about 49-68% for dense or moderately 
distributed default-pressure datasets.
   - Cross-ledger `removeAllUpTo` improves as more complete ledger buckets can 
be removed cheaply.
   - Same-ledger prefix cleanup is the main performance trade-off; the BitSet 
index is included to contain that regression without making the inner map 
ordered again.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [improve][broker] Speed up PendingAcksMap ack tracking with primitive storage and 50%+ lower memory [pulsar]

Reply via email to