mbutrovich opened a new issue, #21543: URL: https://github.com/apache/datafusion/issues/21543
## Context `ExternalSorter` branches on `sort_in_place_threshold_bytes` (default 1MB) in [`in_mem_sort_stream()`](https://github.com/apache/datafusion/blob/main/datafusion/physical-plan/src/sorts/sort.rs#L689-L721): - **Below 1MB**: concatenate all buffered batches into one `RecordBatch`, sort in place - **Above 1MB**: sort each batch individually, then streaming-merge them This threshold was introduced in May 2023 by @tustvold in #6163 ("Adaptive in-memory sort") with the comment: *"This is a very rough heuristic and likely could be refined further."* It was later extracted to a config option by @alamb in #7130 with the same 1MB default. The default hasn't changed since, though the surrounding sort architecture has evolved significantly: multi-level merge (#15700), chunked sort output (#19494), `IncrementalSortIterator` (#20314), and `PartialSortExec` (#9125). ## Problem The sort-each-batch-then-merge path dominates real workloads because typical in-memory buffer sizes exceed 1MB. In this path, each batch (often 1024–8192 rows) is sorted individually via `lexsort_to_indices` and then merged via `StreamingMergeBuilder`. This means: 1. **Per-batch sort kernels can't amortize overhead.** Row-format sorting (e.g., MSD radix sort on `RowConverter` output, apache/arrow-rs#9683) is 2–3x faster than `lexsort_to_indices` at 32K+ rows, but at 1K–8K rows the `RowConverter` encoding cost dominates. The sort-then-merge path never gives these kernels enough rows to benefit. #21525 attempted to integrate the radix sort kernel into `ExternalSorter` and saw no improvement for this reason. 2. **The concat path is gated on memory, not row count.** The 1MB threshold is a memory proxy, but the actual concern is the temporary 2x memory spike from `concat_batches` (plus `RowConverter` allocation on top). A row-count or batch-count heuristic might be a better fit. 3. **The `sort` benchmark doesn't exercise the merge path.** The benchmark produces 8 partitions of ~12 batches at 1024 rows each. In the `sort partitioned` variant, each partition's ~12K rows (~100KB for integers) falls well below the 1MB threshold, so it always takes the concat-and-sort-in-place path. This means benchmark results don't reflect the sort-then-merge path that dominates at larger data sizes. ## Possible directions These aren't mutually exclusive: - **Raise or rethink the threshold.** The 1MB limit was chosen conservatively. With `IncrementalSortIterator` (added in #20314) now yielding sorted output in chunks, the peak memory of the concat-and-sort path may be more manageable than it was in 2023. Could we raise it, or gate on row count instead? - **Coalesce batches before sorting in the merge path.** When the merge path is taken, we could concatenate small batches into larger ones (e.g., 32K–64K rows) before sorting, giving row-format kernels enough rows to amortize encoding. This also reduces the merge fan-in, which is related to #7181 (cascaded merge for large fan-in). This doesn't require concatenating the entire buffer — just local coalescing. - **Incremental `Rows` encoding.** An arrow-rs API to append to a `Rows` buffer across batches would let `ExternalSorter` amortize `RowConverter` encoding as data arrives, rather than paying it all at sort time. This is a longer-term arrow-rs change. - **Fix the benchmark.** Increase `BATCH_SIZE` and/or `INPUT_SIZE` in `benches/sort.rs` so that `sort partitioned` exercises the sort-then-merge path. ## Related issues - #7181 — Cascaded merge to reduce fan-in when merging many small sorted batches - #19481 — Operators should respect `batch_size` instead of `Emit::All` (large batches flowing into `ExternalSorter` cause memory pressure) - #19679 — Spill batches in chunks to avoid requiring full sort memory upfront ## References - #6163 — Adaptive in-memory sort (introduced the 1MB threshold) - #7130 — Extracted threshold to config - #21525 — Radix sort integration attempt (benchmarks show no improvement due to small batch sizes) - apache/arrow-rs#9683 — MSD radix sort kernel (2–3x faster at 32K+ rows) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
