mbutrovich commented on PR #21629: URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4261195635
## Benchmark Analysis: `externalsorter` vs `main` Ran `cargo bench --bench sort -p datafusion` on both branches. ### Improvements (numeric, StringView, single-column string) | Benchmark | externalsorter | main | Speedup | |:---|---:|---:|---:| | sort i64 100k | 2.40 ms | 3.88 ms | 1.61x | | sort f64 100k | 2.61 ms | 3.94 ms | 1.51x | | sort merge i64 1M | 26.6 ms | 38.5 ms | 1.45x | | sort i64 1M | 33.8 ms | 45.5 ms | 1.35x | | sort f64 1M | 35.8 ms | 46.4 ms | 1.30x | | sort utf8 high cardinality 1M | 59.7 ms | 74.0 ms | 1.24x | | sort utf8 view tuple 1M | 8.1 ms | 9.1 ms | 1.13x | ### Regressions (multi-column StringArray and Dictionary) | Benchmark | externalsorter | main | Slowdown | |:---|---:|---:|---:| | sort utf8 tuple 100k (3x StringArray) | 23.7 ms | 9.5 ms | 2.5x | | sort utf8 tuple 1M (3x StringArray) | 272 ms | 111 ms | 2.5x | | sort mixed dictionary tuple 1M (3x dict + i64) | 283 ms | 108 ms | 2.6x | | sort utf8 dictionary tuple 1M (3x dict) | 92.4 ms | 78.5 ms | 1.18x | ### Root cause The coalesce-then-sort pipeline does: coalesce (copy all column data) -> `lexsort_to_indices` -> `take` (random-access scatter-gather) -> chunk back to `batch_size`. With multiple StringArray or Dictionary columns at 32K rows, the `take` scatter-gathers across ~1.9 MB of string heap data, exceeding L2 cache. Schemas that **don't** regress help pinpoint the cause: - **Single StringArray** (e.g. `sort utf8 high cardinality 1M`, +1.24x): one ~640KB string buffer fits in L2 during `take` - **StringViewArray** (e.g. `sort utf8 view tuple 1M`, +1.13x): `take` copies fixed-size 16-byte view structs — no heap random access - **Fixed-width** (i64, f64): coalesce is memcpy, `take` is sequential — all cache-friendly ### Benchmark batch size The sort benchmark uses `BATCH_SIZE = 1024`, while DataFusion's default is 8192. This makes the coalesce expand 32x (1024 -> 32768) instead of 4x (8192 -> 32768), amplifying the copy cost. Small batches are a valid scenario though — filters, joins, and sources with pushdown can all produce sub-8192 batches. ### Proposed fix Arrow's `BatchCoalescer` has a `with_biggest_coalesce_batch_size(Some(limit))` option: batches larger than `limit` bypass coalescing and pass through directly. For schemas with non-view variable-length columns (StringArray, BinaryArray, DictionaryArray), set this to `batch_size`. This way: - Full-size batches (>= 8192 rows) pass through without string copying — sorted directly as individual runs - Small batches still coalesce to reduce fan-in - Fixed-width and StringView schemas keep the full 32K coalesce target unchanged -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
