damahua opened a new pull request, #21325:
URL: https://github.com/apache/datafusion/pull/21325

   ## Which issue does this PR close?
   
   Related to #19444 (sort spill StringView gc) and #20500 (repartition 
StringView gc). This PR extends the same fix to hash aggregation and sort-merge 
join spill paths, and adds BinaryViewArray support to the sort operator.
   
   ## Rationale
   
   After operations like `take` or `slice`, `StringViewArray` and 
`BinaryViewArray` retain shared references to **all** original data buffers. 
When these batches are written to spill files individually, the IPC writer must 
include every referenced buffer for every batch, causing massive write 
amplification.
   
   The sort operator already had a fix for this (`organize_stringview_arrays` 
in `sort.rs`), but the **hash aggregation** and **sort-merge join** spill paths 
were missing it. Additionally, the sort operator's fix only handled 
`StringViewArray`, not `BinaryViewArray`.
   
   ### Hash aggregation spill path
   
   In `row_hash.rs`, `IncrementalSortIterator` produces output chunks via 
`take_record_batch`. Each chunk shares the same StringView data buffers as the 
parent emitted batch. Without `gc()`, spilling N chunks writes N copies of all 
shared buffers.
   
   ### Sort-merge join spill path
   
   In `bitwise_stream.rs`, `inner_key_buffer` contains sliced batches that 
share StringView data buffers with the original unsliced batches.
   
   ## Changes
   
   1. **New `gc_view_arrays()` utility** in `spill/mod.rs` — compacts both 
`StringViewArray` and `BinaryViewArray` in a `RecordBatch`, returning the batch 
unchanged (no allocation) when no view-type columns exist
   2. **Hash aggregation spill** (`row_hash.rs`) — gc each 
`IncrementalSortIterator` output batch before writing to the spill file
   3. **Sort-merge join spill** (`bitwise_stream.rs`) — gc sliced 
`inner_key_buffer` batches before spilling
   4. **Sort operator** (`sort.rs`) — extended existing 
`organize_stringview_arrays` to also handle `BinaryViewArray`
   
   ## A/B Benchmark Results
   
   **Workload:** `SELECT group_key, COUNT(*), SUM(value) FROM t GROUP BY 
group_key`
   - 100,000 rows, 50,000 unique groups (high cardinality → forces spilling)
   - `group_key`: `Utf8View` (StringViewArray) with 50+ byte non-inline strings
   - Memory pool: 20 MB (FairSpillPool), single partition, batch_size=8192
   - Same source commit, only gc patch differs. N=3 runs, deterministic data (0 
variance).
   
   | Metric | Baseline (no gc) | With gc | Change |
   |--------|-----------------|---------|--------|
   | **Spilled bytes** | **39.50 MB** | **7.90 MB** | **-80.0%** |
   | Output bytes | 103.3 MB | 35.5 MB | -65.6% |
   | Query time | 321 ± 11 ms | 324 ± 9 ms | ~same |
   | Peak memory | 19.82 MB | 19.82 MB | same |
   | Spill count | 2 | 2 | same |
   
   With a tighter 8 MB pool: **baseline OOMs** (`ResourcesExhausted: Failed to 
allocate additional 10.4 MB for GroupedHashAggregateStream`) because inflated 
StringView buffers cause the sort memory estimate to exceed the pool. 
**Optimized completes successfully** (5 spills, 20.9 MB spilled).
   
   ## Tests
   
   - All 1,299 existing `datafusion-physical-plan` lib tests pass (1 
pre-existing zstd feature failure)
   - All 29 `memory_limit` integration tests pass
   - All 56 `sort_merge_join` tests pass
   - Added 4 new tests:
     - `test_gc_view_arrays_reduces_spill_size` — verifies gc compacts taken 
StringView/BinaryView batches
     - `test_gc_view_arrays_write_amplification` — demonstrates 8.5× write 
amplification without gc
     - `test_gc_view_arrays_noop_for_non_view_types` — verifies no overhead for 
non-view types
     - `bench_stringview_aggregate_spill` — end-to-end benchmark with EXPLAIN 
ANALYZE metrics


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to