nodece opened a new pull request, #26010:
URL: https://github.com/apache/pulsar/pull/26010

   ### Motivation
   
   `TripleLongPriorityQueue` is the core data structure for Pulsar's delayed 
message delivery, storing `(deliverAt, ledgerId, entryId)` tuples in a min-heap 
backed by `SegmentedLongArray` (direct memory, 16MB segments). CPU profiling 
shows `siftUp`/`siftDown` consuming ~37% of hot thread time during delayed 
message processing.
   
   The original implementation has two performance issues:
   
   1. **Redundant readLong calls**: `compare(tupleIdx1, tupleIdx2)` reads 6 
longs per comparison, but the current node's values are already known from the 
prior iteration — they're re-read from `SegmentedLongArray` unnecessarily.
   2. **Swap-based sift**: Each heap layer swap performs 6 readLong + 6 
writeLong, but a hole-based approach only needs 3 writeLong per layer.
   
   Each `readLong` on `SegmentedLongArray` incurs 3 layers of indirection 
(division + `ArrayList.get()` + `ByteBuf.getLong()`), so reducing readLong 
calls directly translates to throughput gains.
   
   ### Modifications
   
   Rewrote `siftUp` and `siftDown` using the hole-based algorithm with 
register-cached values:
   
   - **`siftUp(tupleIdx, n1, n2, n3)`**: Passes the inserted element's values 
as parameters. Each layer only reads the parent's 3 longs (was: 6 longs for 
self + parent), then writes the parent down. The displaced value lives in local 
variables and is written once at the final position.
   - **`siftDown(tupleIdx, val0, val1, val2)`**: Pop reads the last tuple and 
passes it directly. Each layer reads left + right child's 3 longs each (was: 6 
longs for self + left + right), promotes the smaller child, and writes the 
displaced value at the leaf.
   - **`add()`**: Writes tuple to array directly, passes cached values to 
`siftUp`.
   - **`pop()`**: Reads last tuple, passes to `siftDown(0, n1, n2, n3)`.
   - Removed `compare()`, `swap()`, `put()` methods.
   - `>>> 1` instead of `/ 2` for unsigned shift.
   
   Per-layer readLong reduction:
   
   | Operation | Before | After |
   |-----------|--------|-------|
   | siftUp per layer | 12 | 3 |
   | siftDown per layer | 24 | 6 |
   
   Benchmark results on `SegmentedLongArray`:
   
   | Dataset size | Before (AVG) | After (AVG) | Improvement |
   |---|---|---|---|
   | 50K | 38,690 us | 34,220 us | **11.5%** |
   | 500K | 525,517 us | 461,174 us | **12.2%** |
   | 2M | 2,136,791 us | 1,704,437 us | **20.2%** |
   
   ### Further optimization
   
   The remaining performance gap is dominated by `SegmentedLongArray`'s 
per-readLong indirection cost. Replacing `SegmentedLongArray` with a flat 
`long[]` eliminates the segment lookup entirely — each `readLong` becomes a 
single array load with hardware prefetcher support. Benchmark comparison shows 
`long[]`-backed implementation is **~3x faster** than `SegmentedLongArray` at 
the same heap operations. This could be considered as a follow-up if the direct 
memory trade-off (GC pressure vs. off-heap) is acceptable for the delayed 
delivery use case.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to