Dandandan opened a new pull request, #23162:
URL: https://github.com/apache/datafusion/pull/23162

   ## Which issue does this PR close?
   
   <!-- No dedicated issue; this is a self-contained performance improvement to 
the
   sort-preserving merge hot path. Happy to file a tracking issue if preferred. 
-->
   
   - N/A (performance)
   
   ## Rationale for this change
   
   The primary cost of `SortPreservingMergeStream` — the loser-tree k-way merge
   behind `SortPreservingMergeExec` and `SortExec`'s streaming/spill merge — is 
the
   per-row comparison loop, **not** the output `interleave`. Profiling the
   `sort_preserving_merge` bench on a single primitive sort column shows ~60% of
   time in the merge loop + cursor comparison and only ~7% in `interleave`.
   
   Two inefficiencies on that hot path:
   
   1. The single-column primitive/array cursor comparison was **not being 
inlined**
      (it appeared as a separate ~21% self-time symbol), and every comparison
      bounds-checked the underlying `ScalarBuffer` twice.
   2. `maybe_poll_stream` was called on **every** output row, even though it is 
a
      no-op whenever the winner's cursor is still live (the common case).
   
   ## What changes are included in this PR?
   
   - Inline the lightweight primitive/array cursor comparisons. `RowValues` is 
left
     **out-of-line on purpose**: its variable-length byte comparison is heavy 
and
     force-inlining it regresses the multi-column path.
   - Skip the per-row `maybe_poll_stream` call unless the winner's cursor is
     actually exhausted and needs a fresh `RecordBatch`.
   - Cache the current (and previous) value of a primitive cursor, refreshed 
once
     per `advance()` via a new `CursorValues::set_offset` hook (default no-op;
     `ArrayValues` forwards to the inner cursor). The hot 
`compare`/`eq_to_previous`
     then read cached fields instead of indexing the buffer, so the 
per-comparison
     bounds checks disappear. The one remaining buffer access (in `set_offset`) 
is
     dominated by the `offset < len` guard in `advance`, so the optimizer elides
     that bounds check too — the buffer length is checked **once per row**, not 
per
     comparison. **No `unsafe` is used**; verified in the disassembly that no
     `cursor.rs` `panic_bounds_check` remains on the hot path.
   
   Benchmarks (`cargo bench --bench sort_preserving_merge`, 3 partitions × 1M 
rows,
   M1 Pro):
   
   | case | change |
   | --- | --- |
   | single u64 column | ~−15% |
   | multiple u64 columns | ~−6% |
   | single large string column | ~−5% |
   
   ## Are these changes tested?
   
   Covered by the existing tests in `sorts/merge.rs`, `sorts/cursor.rs`, and
   `sorts/sort_preserving_merge.rs` (merge ordering, nulls, stable sort,
   round-robin tie-breaker). The new cache invariant is additionally guarded by 
a
   `debug_assert!`, which is exercised by those tests when run with debug
   assertions. Behavior-preserving, so no new tests are added.
   
   ## Are there any user-facing changes?
   
   No.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to