Dandandan opened a new pull request, #23221:
URL: https://github.com/apache/datafusion/pull/23221
## Which issue does this PR close?
Follow-up to #23202 (coalesce single-column sort runs). No specific issue.
## Rationale for this change
#23202 coalesces buffered runs into fewer, larger runs to cut merge fan-in,
but
**only for single-column sorts**. Multi-column runs were deliberately left
one
run per batch, because coalescing them and sorting the larger run with
`lexsort_to_indices` *regresses*: `lexsort`'s comparator walks the key
columns
one at a time with per-comparison dynamic dispatch (and heap touches for
strings/dictionaries), and that cost grows super-linearly with run size.
The fix is to change *how* a multi-column run is sorted, not just its size:
for
keys whose leading column is an expensive (variable-length or dictionary)
comparison, encode the key once into the Arrow **row format** and argsort the
row indices with a cheap `memcmp`. This is the same comparison the streaming
merge already uses (`RowCursorStream`), so it is a natural fit — and it
removes
exactly the comparator cost that made coalescing regress. With that in place
we
can coalesce **all** multi-column runs.
A microbenchmark (`sort_indices`, added here) isolates the index-computation
cost — `lexsort_to_indices` vs. row-encode+argsort — by key shape and run
size
(speedup = lexsort / rowsort; >1 means row format is faster):
| key shape | 1k | 64k | 1M |
|---|---|---|---|
| i64 (single) | 0.16× | 0.17× | 0.15× |
| utf8 high (single) | 0.22× | 0.31× | 0.66× |
| utf8 tuple | 1.86× | 2.46× | 3.38× |
| dict tuple | 2.06× | 2.66× | 2.72× |
| utf8 view tuple | 0.67× | 1.10× | 1.48× |
| mixed tuple (i64-leading) | 0.49× | 0.67× | 0.78× |
This is what motivates the gate: row format wins big for string/dict-leading
keys (and more as runs grow), but **loses** for single-column keys and for
primitive-leading keys (where `lexsort` short-circuits on the cheap first
column). So the row-format path is gated to multi-column keys with an
expensive
leading column; everything else keeps `lexsort`.
End-to-end (`cargo bench --bench sort`, the `SortExec` cases, vs. current
main):
| benchmark | speedup |
|---|---|
| sort utf8 tuple 100k / 1M | 1.34× / 1.19× |
| sort utf8 view tuple 100k / 1M | 1.22× / 1.17× |
| sort utf8 dictionary tuple 100k / 1M | 1.23× / 1.74× |
| sort mixed dictionary tuple 100k / 1M | 1.47× / 1.81× |
| sort mixed tuple 100k / 1M | 1.44× / 1.45× |
| sort mixed tuple with utf8 view 100k / 1M | 1.51× / 1.40× |
| sort i64 100k / 1M | ~unchanged (noise) |
String/dict-leading tuples win via coalesce + row-format sort; the
primitive-leading `mixed tuple` wins ~1.4× purely from coalesce + `lexsort`
on
larger runs (it now also gets coalesced). Single-column sorts are unchanged.
## What changes are included in this PR?
- `sorted_indices()` in `sorts/stream.rs`: computes the sorted index
permutation,
choosing the Arrow row format for multi-column keys with an expensive
leading
column (`use_row_format_sort`) and `lexsort_to_indices` otherwise (single
column, primitive-leading, or `fetch`/top-k).
- `IncrementalSortIterator` (the single point every in-memory sort run flows
through) now calls `sorted_indices()`. No merge/cursor changes.
- `in_mem_sort_stream` now coalesces **all** runs, not just single-column
ones.
- New microbenchmark `datafusion/core/benches/sort_indices.rs`.
## Are these changes tested?
Existing `sorts::` unit tests pass (NULLs, sort options, multi-column merge,
spill). Correctness of the row-format ordering is covered by the existing
multi-column sort tests; the new microbenchmark documents the perf rationale.
## Are there any user-facing changes?
No (internal performance change only).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]