[I] Revisit ExternalSorter sort strategy and sort_in_place_threshold_bytes [datafusion]

via GitHub Fri, 10 Apr 2026 13:18:32 -0700


mbutrovich opened a new issue, #21543:
URL: https://github.com/apache/datafusion/issues/21543

## Context

`ExternalSorter` branches on `sort_in_place_threshold_bytes` (default 1MB)
in
[`in_mem_sort_stream()`](https://github.com/apache/datafusion/blob/main/datafusion/physical-plan/src/sorts/sort.rs#L689-L721):

- **Below 1MB**: concatenate all buffered batches into one `RecordBatch`,
sort in place
- **Above 1MB**: sort each batch individually, then streaming-merge them

This threshold was introduced in May 2023 by @tustvold in #6163 ("Adaptive
in-memory sort") with the comment: *"This is a very rough heuristic and likely
could be refined further."* It was later extracted to a config option by @alamb
in #7130 with the same 1MB default. The default hasn't changed since, though
the surrounding sort architecture has evolved significantly: multi-level merge
(#15700), chunked sort output (#19494), `IncrementalSortIterator` (#20314), and
`PartialSortExec` (#9125).

## Problem

The sort-each-batch-then-merge path dominates real workloads because typical
in-memory buffer sizes exceed 1MB. In this path, each batch (often 1024–8192
rows) is sorted individually via `lexsort_to_indices` and then merged via
`StreamingMergeBuilder`. This means:

1. **Per-batch sort kernels can't amortize overhead.** Row-format sorting
(e.g., MSD radix sort on `RowConverter` output, apache/arrow-rs#9683) is 2–3x
faster than `lexsort_to_indices` at 32K+ rows, but at 1K–8K rows the
`RowConverter` encoding cost dominates. The sort-then-merge path never gives
these kernels enough rows to benefit. #21525 attempted to integrate the radix
sort kernel into `ExternalSorter` and saw no improvement for this reason.

2. **The concat path is gated on memory, not row count.** The 1MB threshold
is a memory proxy, but the actual concern is the temporary 2x memory spike from
`concat_batches` (plus `RowConverter` allocation on top). A row-count or
batch-count heuristic might be a better fit.

3. **The `sort` benchmark doesn't exercise the merge path.** The benchmark
produces 8 partitions of ~12 batches at 1024 rows each. In the `sort
partitioned` variant, each partition's ~12K rows (~100KB for integers) falls
well below the 1MB threshold, so it always takes the concat-and-sort-in-place
path. This means benchmark results don't reflect the sort-then-merge path that
dominates at larger data sizes.
## Possible directions

These aren't mutually exclusive:

- **Raise or rethink the threshold.** The 1MB limit was chosen
conservatively. With `IncrementalSortIterator` (added in #20314) now yielding
sorted output in chunks, the peak memory of the concat-and-sort path may be
more manageable than it was in 2023. Could we raise it, or gate on row count
instead?

- **Coalesce batches before sorting in the merge path.** When the merge path
is taken, we could concatenate small batches into larger ones (e.g., 32K–64K
rows) before sorting, giving row-format kernels enough rows to amortize
encoding. This also reduces the merge fan-in, which is related to #7181
(cascaded merge for large fan-in). This doesn't require concatenating the
entire buffer — just local coalescing.

- **Incremental `Rows` encoding.** An arrow-rs API to append to a `Rows`
buffer across batches would let `ExternalSorter` amortize `RowConverter`
encoding as data arrives, rather than paying it all at sort time. This is a
longer-term arrow-rs change.

- **Fix the benchmark.** Increase `BATCH_SIZE` and/or `INPUT_SIZE` in
`benches/sort.rs` so that `sort partitioned` exercises the sort-then-merge path.

## Related issues

- #7181 — Cascaded merge to reduce fan-in when merging many small sorted
batches
- #19481 — Operators should respect `batch_size` instead of `Emit::All`
(large batches flowing into `ExternalSorter` cause memory pressure)
- #19679 — Spill batches in chunks to avoid requiring full sort memory
upfront

## References

- #6163 — Adaptive in-memory sort (introduced the 1MB threshold)
- #7130 — Extracted threshold to config
- #21525 — Radix sort integration attempt (benchmarks show no improvement
due to small batch sizes)
- apache/arrow-rs#9683 — MSD radix sort kernel (2–3x faster at 32K+ rows)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Revisit ExternalSorter sort strategy and sort_in_place_threshold_bytes [datafusion]

Reply via email to