mbutrovich commented on issue #21543:
URL: https://github.com/apache/datafusion/issues/21543#issuecomment-4264674154

   I worked on several implementations tonight, summarizing findings (thanks 
Claude):
   
   > ## ExternalSorter optimization — full summary of learnings
   > 
   > ### The problem
   > 
   > ExternalSorter sorts each incoming batch individually, producing one 
sorted run per batch. At scale (TPC-H SF10, 60M rows), this creates ~7300 
sorted runs with high merge fan-in. Coalescing batches before sorting reduces 
fan-in and gave 1.2-1.5x TPC-H speedups — but introduced regressions for 
multi-column string sorts.
   > 
   > ### Iterations
   > 
   > **externalsorter (PR #21629)**: BatchCoalescer coalesces all columns to 
32K → lexsort → take.
   > - Fixed-width: 1.3-1.6x faster. StringView: 1.1-1.2x faster. TPC-H: 
1.08-1.51x faster.
   > - Multi-column StringArray: 2.5x slower. Dictionary: 2.6x slower.
   > - Root cause: `take` scatter-gathers across ~1.9MB of string heap at 32K 
rows, exceeding L2 cache.
   > 
   > **externalsorter2**: Key-only extraction + interleave. Extract sort-key 
columns (promote to StringView), concat keys only, sort, translate indices to 
(batch_idx, row_idx) pairs, `interleave_record_batch` on original batches.
   > - Fixed single-column regressions vs es1.
   > - Multi-column still slow: `interleave_record_batch` is expensive for 
StringArray (same scatter-gather) and DictionaryArray (dictionary merging + key 
remapping per chunk).
   > - Learning: existing benchmarks sort on ALL columns, so key-only 
extraction provided zero benefit for them. Need wide-schema benchmarks to 
measure the real benefit.
   > 
   > **externalsorter3**: Internal StringView representation. Convert ALL 
columns StringArray→StringView on input, concat, sort, take on views (16-byte 
copies), convert back at output boundary.
   > - Uniformly better than es2 — removed interleave overhead, direct take 
instead.
   > - Big win: `sort utf8 low cardinality` 1.78x faster than main (StringView 
prefix comparisons for short inlined strings).
   > - Multi-column tuples still 1.5-2.6x slower than main.
   > 
   > **externalsorter4**: Key-only extraction + RowConverter sort + hybrid 
reconstruction. Extract key columns, concat, sort via RowConverter (for 
multi-column varlen) or lexsort (for single-col/fixed-width), take keys from 
concat batch, interleave values from originals.
   > - RowConverter cut `sort utf8 tuple` from 180→144ms (22% vs es3). `sort 
utf8 view tuple` from 156→123ms (21%).
   > - Fixed-width unchanged at 1.31x faster than main.
   > - Multi-column strings now 1.22-1.30x slower than main (down from 2.5x at 
start).
   > 
   > ### Key technical findings
   > 
   > **1. `take` on StringArray at large batch sizes is cache-hostile**
   > 
   > `take` with a random permutation scatter-gathers across the string 
offset/data buffers. At 32K rows × 3 StringArray columns ≈ 1.9MB — exceeds L2 
cache. At 1024 rows × 3 columns ≈ 30KB — fits in L1. This is the fundamental 
source of the multi-column string regression.
   > 
   > **2. Multi-column sort regression scales superlinearly with column count**
   > 
   > | Benchmark | columns | main (ms) | es3 (ms) |
   > |:---|---:|---:|---:|
   > | sort utf8 view high card | 1 | 62.3 | 58.8 (faster) |
   > | sort utf8 view tuple | 3 | 100.6 | 156.1 (slower) |
   > 
   > 1 column wins, 3 columns loses badly. The cause: `lexsort_to_indices` with 
multiple columns does cascading comparisons. With low-cardinality first 
columns, most comparisons cascade to column 2 and 3. At 32K rows the sort 
algorithm's random access pattern crosses cache boundaries on every comparison 
across 3 separate column arrays.
   > 
   > **3. RowConverter encoding solves the multi-column cache problem**
   > 
   > RowConverter encodes all sort key columns into one contiguous buffer of 
binary-comparable rows. The sort operates on this single buffer instead of 
jumping between separate column arrays. Cache prefetcher works because access 
stays within one memory region. arrow-rs#9683 benchmarks confirm: 
`lexsort_rows` is 1.3-2.5x faster than `lexsort_to_indices` for multi-column 
string/dictionary schemas at 32K rows.
   > 
   > **4. But RowConverter is slower for single-column and fixed-width sorts**
   > 
   > Encoding overhead exceeds the benefit when `lexsort_to_indices` can use 
SIMD comparisons on native column data. Schema-based kernel selection is 
necessary: RowConverter for multi-column varlen, lexsort for everything else.
   > 
   > **5. GC on StringView concat batches doesn't help**
   > 
   > We hypothesized that compacting 32 backing buffers to 1 would improve sort 
locality. Benchmarks showed no improvement — the cache problem is the total 
working set size (960KB for 32K × 3 columns), not the number of buffers.
   > 
   > **6. StringView promotion helps for short strings, not long strings**
   > 
   > StringView inlines strings ≤12 bytes in the 16-byte view struct — no 
pointer chase needed. For strings >12 bytes, comparison still chases a pointer, 
making StringView no better than StringArray. The benchmark's low-cardinality 
values ("value0"–"value99", 6-8 chars) benefit from inlining. High-cardinality 
values (20-char random strings) don't.
   > 
   > **7. Non-key columns should never be converted or concatenated**
   > 
   > Converting/concatenating value columns that aren't sort keys is pure 
overhead. The key-only approach (es2/es4) avoids this. For wide schemas (small 
key + large value), the savings are significant.
   > 
   > **8. `interleave_record_batch` is expensive for DictionaryArray**
   > 
   > `interleave_dictionaries` merges dictionary values across all input 
batches and recomputes key mappings per chunk. For 32 input batches × 3 
dictionary columns × 4 chunks per run, this is a lot of dictionary merging work 
that `take` on a single batch doesn't need.
   > 
   > **9. Dictionary encoding is increasingly irrelevant**
   > 
   > Comet unpacks dictionaries to StringArray anyway. DataFusion native uses 
StringView from Parquet. Dictionary sort performance doesn't represent real 
workloads.
   > 
   > **10. The benchmark's BATCH_SIZE=1024 is pessimistic**
   > 
   > With 1024-row input batches, per-batch sort on main is very cheap (30KB 
working set fits in L1). Coalescing accumulates 32 batches per run, adding 
significant per-run overhead (32 key extractions, 32-batch concat). At 
DataFusion's default 8192-row batches, only 4 batches per run — much less 
overhead, and main's per-batch sort has a larger working set (less of a cache 
advantage).
   > 
   > **11. The merge uses RowConverter anyway**
   > 
   > The streaming merge (StreamingMergeBuilder) already encodes sort keys via 
RowConverter for merge comparisons. So using RowConverter during the sort step 
doesn't add new encoding work to the total pipeline — it moves the encoding 
earlier (sort time vs merge time).
   > 
   > **12. Arrow-rs improvements that would help**
   > 
   > - `concat_batches` for StringView producing 1 backing buffer directly 
(avoid N-buffer scatter)
   > - Efficient Dictionary → StringView conversion (convert dictionary values 
only, not every row)
   > - `lexsort_rows` / `lexsort_radix` from arrow-rs#9683 (purpose-built sort 
on row-encoded data)
   > 
   > ### Final benchmark results (sort 1M, all iterations vs main)
   > 
   > | Benchmark | main | es1 | es2 | es3 | es4 |
   > |:---|---:|---:|---:|---:|---:|
   > | sort i64 | 45.5 | 33.8 | 37.4 | 34.6 | 34.8 |
   > | sort f64 | 46.4 | 35.8 | 39.7 | 36.7 | — |
   > | sort utf8 low card | 67.4 | 64.6 | 57.7 | 37.9 | — |
   > | sort utf8 high card | 74.0 | 59.7 | 73.6 | 66.6 | — |
   > | sort utf8 view low card | 34.9 | 25.6 | 29.0 | 26.3 | — |
   > | sort utf8 view high card | 62.3 | 65.1 | 66.6 | 58.8 | — |
   > | sort utf8 dict | 49.3 | 31.8 | 35.1 | 31.9 | — |
   > | sort mixed tuple w/ view | 93.9 | 85.9 | 90.1 | 85.2 | — |
   > | sort mixed tuple | 102.3 | 105.7 | 112.5 | 105.8 | — |
   > | sort utf8 dict tuple | 78.5 | 92.4 | 96.5 | 92.0 | 91.3 |
   > | sort utf8 tuple | 111.0 | 272.1 | 187.6 | 180.2 | 144.5 |
   > | sort utf8 view tuple | 100.6 | 163.7 | 163.9 | 156.1 | 123.2 |
   > | sort mixed dict tuple | 108.2 | 282.6 | 288.1 | 282.9 | 282.6 |
   > 
   > ### Approach summary
   > 
   > | Iteration | Key idea | What improved | What didn't |
   > |:---|:---|:---|:---|
   > | es1 | Coalesce all columns, lexsort, take | Fixed-width, single string, 
TPC-H | Multi-col string/dict (cache thrashing on take) |
   > | es2 | Key-only extract, interleave values | Single-col sorts | Multi-col 
(interleave expensive for string/dict) |
   > | es3 | Internal StringView, full-batch concat+take | Low-card strings, 
uniformly better than es2 | Multi-col tuple (working set exceeds L2) |
   > | es4 | Key-only + RowConverter sort + hybrid reconstruct | Multi-col 
string tuple (22% improvement) | Dict (not using RowConverter), residual 
overhead |
   > 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to