Dandandan opened a new pull request, #23162:
URL: https://github.com/apache/datafusion/pull/23162
## Which issue does this PR close?
<!-- No dedicated issue; this is a self-contained performance improvement to
the
sort-preserving merge hot path. Happy to file a tracking issue if preferred.
-->
- N/A (performance)
## Rationale for this change
The primary cost of `SortPreservingMergeStream` — the loser-tree k-way merge
behind `SortPreservingMergeExec` and `SortExec`'s streaming/spill merge — is
the
per-row comparison loop, **not** the output `interleave`. Profiling the
`sort_preserving_merge` bench on a single primitive sort column shows ~60% of
time in the merge loop + cursor comparison and only ~7% in `interleave`.
Two inefficiencies on that hot path:
1. The single-column primitive/array cursor comparison was **not being
inlined**
(it appeared as a separate ~21% self-time symbol), and every comparison
bounds-checked the underlying `ScalarBuffer` twice.
2. `maybe_poll_stream` was called on **every** output row, even though it is
a
no-op whenever the winner's cursor is still live (the common case).
## What changes are included in this PR?
- Inline the lightweight primitive/array cursor comparisons. `RowValues` is
left
**out-of-line on purpose**: its variable-length byte comparison is heavy
and
force-inlining it regresses the multi-column path.
- Skip the per-row `maybe_poll_stream` call unless the winner's cursor is
actually exhausted and needs a fresh `RecordBatch`.
- Cache the current (and previous) value of a primitive cursor, refreshed
once
per `advance()` via a new `CursorValues::set_offset` hook (default no-op;
`ArrayValues` forwards to the inner cursor). The hot
`compare`/`eq_to_previous`
then read cached fields instead of indexing the buffer, so the
per-comparison
bounds checks disappear. The one remaining buffer access (in `set_offset`)
is
dominated by the `offset < len` guard in `advance`, so the optimizer elides
that bounds check too — the buffer length is checked **once per row**, not
per
comparison. **No `unsafe` is used**; verified in the disassembly that no
`cursor.rs` `panic_bounds_check` remains on the hot path.
Benchmarks (`cargo bench --bench sort_preserving_merge`, 3 partitions × 1M
rows,
M1 Pro):
| case | change |
| --- | --- |
| single u64 column | ~−15% |
| multiple u64 columns | ~−6% |
| single large string column | ~−5% |
## Are these changes tested?
Covered by the existing tests in `sorts/merge.rs`, `sorts/cursor.rs`, and
`sorts/sort_preserving_merge.rs` (merge ordering, nulls, stable sort,
round-robin tie-breaker). The new cache invariant is additionally guarded by
a
`debug_assert!`, which is exercised by those tests when run with debug
assertions. Behavior-preserving, so no new tests are added.
## Are there any user-facing changes?
No.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]