Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]

via GitHub Thu, 16 Apr 2026 08:14:38 -0700


mbutrovich commented on PR #21629:
URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4261195635


   ## Benchmark Analysis: `externalsorter` vs `main`
   
   Ran `cargo bench --bench sort -p datafusion` on both branches.
   
   ### Improvements (numeric, StringView, single-column string)
   
   | Benchmark | externalsorter | main | Speedup |
   |:---|---:|---:|---:|
   | sort i64 100k | 2.40 ms | 3.88 ms | 1.61x |
   | sort f64 100k | 2.61 ms | 3.94 ms | 1.51x |
   | sort merge i64 1M | 26.6 ms | 38.5 ms | 1.45x |
   | sort i64 1M | 33.8 ms | 45.5 ms | 1.35x |
   | sort f64 1M | 35.8 ms | 46.4 ms | 1.30x |
   | sort utf8 high cardinality 1M | 59.7 ms | 74.0 ms | 1.24x |
   | sort utf8 view tuple 1M | 8.1 ms | 9.1 ms | 1.13x |
   
   ### Regressions (multi-column StringArray and Dictionary)
   
   | Benchmark | externalsorter | main | Slowdown |
   |:---|---:|---:|---:|
   | sort utf8 tuple 100k (3x StringArray) | 23.7 ms | 9.5 ms | 2.5x |
   | sort utf8 tuple 1M (3x StringArray) | 272 ms | 111 ms | 2.5x |
   | sort mixed dictionary tuple 1M (3x dict + i64) | 283 ms | 108 ms | 2.6x |
   | sort utf8 dictionary tuple 1M (3x dict) | 92.4 ms | 78.5 ms | 1.18x |
   
   ### Root cause
   
   The coalesce-then-sort pipeline does: coalesce (copy all column data) -> 
`lexsort_to_indices` -> `take` (random-access scatter-gather) -> chunk back to 
`batch_size`. With multiple StringArray or Dictionary columns at 32K rows, the 
`take` scatter-gathers across ~1.9 MB of string heap data, exceeding L2 cache.
   
   Schemas that **don't** regress help pinpoint the cause:
   - **Single StringArray** (e.g. `sort utf8 high cardinality 1M`, +1.24x): one 
~640KB string buffer fits in L2 during `take`
   - **StringViewArray** (e.g. `sort utf8 view tuple 1M`, +1.13x): `take` 
copies fixed-size 16-byte view structs — no heap random access
   - **Fixed-width** (i64, f64): coalesce is memcpy, `take` is sequential — all 
cache-friendly
   
   ### Benchmark batch size
   
   The sort benchmark uses `BATCH_SIZE = 1024`, while DataFusion's default is 
8192. This makes the coalesce expand 32x (1024 -> 32768) instead of 4x (8192 -> 
32768), amplifying the copy cost. Small batches are a valid scenario though — 
filters, joins, and sources with pushdown can all produce sub-8192 batches.
   
   ### Proposed fix
   
   Arrow's `BatchCoalescer` has a 
`with_biggest_coalesce_batch_size(Some(limit))` option: batches larger than 
`limit` bypass coalescing and pass through directly. For schemas with non-view 
variable-length columns (StringArray, BinaryArray, DictionaryArray), set this 
to `batch_size`. This way:
   - Full-size batches (>= 8192 rows) pass through without string copying — 
sorted directly as individual runs
   - Small batches still coalesce to reduce fan-in
   - Fixed-width and StringView schemas keep the full 32K coalesce target 
unchanged
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]

Reply via email to