void-ptr974 commented on issue #26027: URL: https://github.com/apache/pulsar/issues/26027#issuecomment-4701242613
Benchmark notes from the current draft PR: https://github.com/apache/pulsar/pull/26024 These numbers are intended to compare the current nested-map layout (`oldProduction`) with the draft primitive layout (`production`) in the same local run. They should be read as local data-structure cost, not end-to-end broker latency. The benchmark datasets are tied to existing defaults: - `maxUnackedMessagesPerConsumer = 50000` - `maxUnackedMessagesPerSubscription = 200000` - `receiverQueueSize = 1000` The `50k entries / N ledgers` cases represent one consumer near the default per-consumer unacked limit, with the pending window distributed across different numbers of managed-ledger ledgers. ### JMH Time Steady dispatch + ack cycle: | Dataset | Old | New | Change | |---|---:|---:|---:| | 50k entries / 1 ledger | 204.8 ns/op | 192.9 ns/op | -5.8% | | 50k entries / 5 ledgers | 181.1 ns/op | 129.5 ns/op | -28.5% | | 50k entries / 10 ledgers | 189.3 ns/op | 172.8 ns/op | -8.7% | | 50k entries / 20 ledgers | 265.9 ns/op | 163.7 ns/op | -38.4% | Dispatch + ack cycle with periodic partial ack: | Dataset | Old | New | Change | |---|---:|---:|---:| | 50k entries / 1 ledger | 216.2 ns/op | 158.4 ns/op | -26.7% | | 50k entries / 5 ledgers | 185.3 ns/op | 126.1 ns/op | -32.0% | | 50k entries / 10 ledgers | 178.5 ns/op | 131.5 ns/op | -26.3% | | 50k entries / 20 ledgers | 180.1 ns/op | 122.1 ns/op | -32.2% | `removeAllUpTo` across the mark-delete range: | Dataset | Old | New | Change | |---|---:|---:|---:| | 50k entries / 1 ledger | 788.7 us/op | 1903.1 us/op | +141.3% | | 50k entries / 5 ledgers | 572.1 us/op | 430.7 us/op | -24.7% | | 50k entries / 10 ledgers | 533.4 us/op | 204.1 us/op | -61.7% | | 50k entries / 20 ledgers | 487.6 us/op | 122.5 us/op | -74.9% | Same-ledger small-prefix cleanup: | Dataset | Old | New | Change | |---|---:|---:|---:| | 50k entries / 1 ledger | 43.2 us/op | 63.1 us/op | +46.2% | | 50k entries / 5 ledgers | 41.0 us/op | 62.0 us/op | +51.3% | | 50k entries / 10 ledgers | 37.9 us/op | 59.8 us/op | +58.0% | | 50k entries / 20 ledgers | 36.8 us/op | 64.4 us/op | +74.8% | The main trade-off is the same-ledger prefix cleanup path. The primitive map removes the inner `TreeMap.headMap(...).clear()` fast path, so the draft adds a per-ledger `BitSet` index to avoid scanning the whole primitive map for common prefix-cleanup cases. It still does not beat the old inner `TreeMap` in the single-ledger prefix case, but it keeps the regression bounded while preserving the gains on common exact-key operations and cross-ledger cleanup. ### JMH Allocation Allocation was measured with JMH GC profiler using `gc.alloc.rate.norm` (`B/op`). Focused hot operations: | Operation | Dataset | Old | New | |---|---:|---:|---:| | `addOrReplace` | 50k / 1 ledger | 47.939 B/op | 0.000225 B/op | | `removeAndAddRemaining` | 50k / 1 ledger | 87.940 B/op | 0.000544 B/op | | `updateRemainingUnacked` | 50k / 1 ledger | 47.940 B/op | 0.000257 B/op | Values below `0.001 B/op` are effectively noise. This is the expected allocation effect of removing boxed/object per-entry updates from the hot operations. Rolling consumer window allocation: | Scenario | Dataset | Old | New | Change | |---|---:|---:|---:|---:| | dispatch + ack | 50k / 1 ledger | 108.244 B/op | 111.634 B/op | +3.1% | | dispatch + ack | 50k / 5 ledgers | 111.705 B/op | 80.141 B/op | -28.3% | | dispatch + ack | 50k / 10 ledgers | 111.410 B/op | 80.148 B/op | -28.1% | | dispatch + ack | 50k / 20 ledgers | 110.818 B/op | 80.155 B/op | -27.7% | | dispatch + ack + partial ack | 50k / 1 ledger | 108.456 B/op | 113.002 B/op | +4.2% | | dispatch + ack + partial ack | 50k / 5 ledgers | 114.686 B/op | 80.138 B/op | -30.1% | | dispatch + ack + partial ack | 50k / 10 ledgers | 114.371 B/op | 80.150 B/op | -29.9% | | dispatch + ack + partial ack | 50k / 20 ledgers | 113.740 B/op | 80.158 B/op | -29.5% | The cleanup benchmark allocation numbers are not listed as remove-only allocation because that benchmark rebuilds the dataset at invocation setup; the reported allocation includes population cost. ### JOL Footprint JOL was used to measure the retained object graph for the pending-ack map structure. This is not total consumer memory usage; it isolates the data structure changed by the draft. The `New` column includes the primitive map and the BitSet index. | Scenario | Old | New | Change | |---|---:|---:|---:| | 1k entries / 1 ledger | 86.3 KiB | 34.4 KiB | -60.2% | | 100 entries / 1 ledger | 8.9 KiB | 4.6 KiB | -48.3% | | 25k entries / 1 ledger | 2.10 MiB | 1.06 MiB | -49.4% | | 50k entries / 1 ledger | 4.20 MiB | 2.13 MiB | -49.4% | | 50k entries / 5 ledgers | 4.20 MiB | 1.33 MiB | -68.3% | | 50k entries / 10 ledgers | 4.20 MiB | 1.33 MiB | -68.3% | | 50k entries / 20 ledgers | 4.20 MiB | 1.33 MiB | -68.3% | | 200k entries / 100 ledgers | 16.80 MiB | 6.66 MiB | -60.4% | | 200k entries / 500 ledgers | 16.84 MiB | 8.37 MiB | -50.3% | Sparse many-ledger cases are less favorable because each non-empty ledger bucket has fixed primitive-map/index overhead: | Scenario | Old | New | Change | |---|---:|---:|---:| | 1k entries / 100 ledgers | 97.1 KiB | 68.2 KiB | -29.8% | | 1k entries / 500 ledgers | 140.8 KiB | 340.1 KiB | +141.5% | | 1k entries / 1000 ledgers | 195.5 KiB | 679.9 KiB | +247.7% | This is expected for a bucketed primitive structure: dense or moderately distributed pending windows benefit, while one-entry-per-ledger distributions can pay more fixed bucket overhead. ### Retained Capacity The primitive arrays do not shrink while a ledger bucket remains non-empty. The impact depends on how entries are removed: | Scenario | Old | New | Notes | |---|---:|---:|---| | peak 50k / 1 ledger, then clear all | 216 B | 216 B | bucket is removed | | peak 50k / 1 ledger, keep 1 entry | 416 B | 2.13 MiB | worst retained-capacity shape | | peak 50k / 1 ledger, keep 1k entries | 86.3 KiB | 2.13 MiB | same large bucket remains | | peak 50k / 5 ledgers, keep 1k in last ledger | 86.3 KiB | 272.4 KiB | previous ledgers are released | | peak 50k / 20 ledgers, keep 1k in last ledger | 86.3 KiB | 68.4 KiB | smaller boundary bucket remains | In normal mark-delete progress, completed ledger buckets are removed and their arrays are released. The retained-capacity risk is mainly when a very large ledger bucket drains down to a small tail but remains non-empty for a long time. ### Summary - The common exact-key update operations become effectively allocation-free. - Rolling dispatch/ack allocation improves by about 28-30% when the default 50k pending window spans 5+ ledgers. - Retained object footprint drops by about 49-68% for dense or moderately distributed default-pressure datasets. - Cross-ledger `removeAllUpTo` improves as more complete ledger buckets can be removed cheaply. - Same-ledger prefix cleanup is the main performance trade-off; the BitSet index is included to contain that regression without making the inner map ordered again. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
