mbutrovich opened a new issue, #21543:
URL: https://github.com/apache/datafusion/issues/21543

   ## Context
   
   `ExternalSorter` branches on `sort_in_place_threshold_bytes` (default 1MB) 
in 
[`in_mem_sort_stream()`](https://github.com/apache/datafusion/blob/main/datafusion/physical-plan/src/sorts/sort.rs#L689-L721):
   
   - **Below 1MB**: concatenate all buffered batches into one `RecordBatch`, 
sort in place
   - **Above 1MB**: sort each batch individually, then streaming-merge them
   
   This threshold was introduced in May 2023 by @tustvold in #6163 ("Adaptive 
in-memory sort") with the comment: *"This is a very rough heuristic and likely 
could be refined further."* It was later extracted to a config option by @alamb 
in #7130 with the same 1MB default. The default hasn't changed since, though 
the surrounding sort architecture has evolved significantly: multi-level merge 
(#15700), chunked sort output (#19494), `IncrementalSortIterator` (#20314), and 
`PartialSortExec` (#9125).
   
   ## Problem
   
   The sort-each-batch-then-merge path dominates real workloads because typical 
in-memory buffer sizes exceed 1MB. In this path, each batch (often 1024–8192 
rows) is sorted individually via `lexsort_to_indices` and then merged via 
`StreamingMergeBuilder`. This means:
   
   1. **Per-batch sort kernels can't amortize overhead.** Row-format sorting 
(e.g., MSD radix sort on `RowConverter` output, apache/arrow-rs#9683) is 2–3x 
faster than `lexsort_to_indices` at 32K+ rows, but at 1K–8K rows the 
`RowConverter` encoding cost dominates. The sort-then-merge path never gives 
these kernels enough rows to benefit. #21525 attempted to integrate the radix 
sort kernel into `ExternalSorter` and saw no improvement for this reason.
   
   2. **The concat path is gated on memory, not row count.** The 1MB threshold 
is a memory proxy, but the actual concern is the temporary 2x memory spike from 
`concat_batches` (plus `RowConverter` allocation on top). A row-count or 
batch-count heuristic might be a better fit.
   
   3. **The `sort` benchmark doesn't exercise the merge path.** The benchmark 
produces 8 partitions of ~12 batches at 1024 rows each. In the `sort 
partitioned` variant, each partition's ~12K rows (~100KB for integers) falls 
well below the 1MB threshold, so it always takes the concat-and-sort-in-place 
path. This means benchmark results don't reflect the sort-then-merge path that 
dominates at larger data sizes.
   ## Possible directions
   
   These aren't mutually exclusive:
   
   - **Raise or rethink the threshold.** The 1MB limit was chosen 
conservatively. With `IncrementalSortIterator` (added in #20314) now yielding 
sorted output in chunks, the peak memory of the concat-and-sort path may be 
more manageable than it was in 2023. Could we raise it, or gate on row count 
instead?
   
   - **Coalesce batches before sorting in the merge path.** When the merge path 
is taken, we could concatenate small batches into larger ones (e.g., 32K–64K 
rows) before sorting, giving row-format kernels enough rows to amortize 
encoding. This also reduces the merge fan-in, which is related to #7181 
(cascaded merge for large fan-in). This doesn't require concatenating the 
entire buffer — just local coalescing.
   
   - **Incremental `Rows` encoding.** An arrow-rs API to append to a `Rows` 
buffer across batches would let `ExternalSorter` amortize `RowConverter` 
encoding as data arrives, rather than paying it all at sort time. This is a 
longer-term arrow-rs change.
   
   - **Fix the benchmark.** Increase `BATCH_SIZE` and/or `INPUT_SIZE` in 
`benches/sort.rs` so that `sort partitioned` exercises the sort-then-merge path.
   
   ## Related issues
   
   - #7181 — Cascaded merge to reduce fan-in when merging many small sorted 
batches
   - #19481 — Operators should respect `batch_size` instead of `Emit::All` 
(large batches flowing into `ExternalSorter` cause memory pressure)
   - #19679 — Spill batches in chunks to avoid requiring full sort memory 
upfront
   
   ## References
   
   - #6163 — Adaptive in-memory sort (introduced the 1MB threshold)
   - #7130 — Extracted threshold to config
   - #21525 — Radix sort integration attempt (benchmarks show no improvement 
due to small batch sizes)
   - apache/arrow-rs#9683 — MSD radix sort kernel (2–3x faster at 32K+ rows)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to