andygrove opened a new pull request, #3234:
URL: https://github.com/apache/datafusion-comet/pull/3234

   ## Summary
   
   This PR adds batch coalescing before shuffle writes to reduce per-batch 
overhead and improve vectorization efficiency. When enabled, small columnar 
batches are combined until they reach the target batch size before being 
processed by the shuffle writer.
   
   **Key changes:**
   - Added `spark.comet.shuffle.resizeBatches.input` config to enable 
coalescing batches before shuffle write
   - Added `spark.comet.shuffle.resizeBatches.output` config for coalescing 
after shuffle read  
   - Native planner wraps shuffle input with DataFusion's `CoalesceBatchesExec` 
when input coalescing is enabled
   - Added `CometBatchCoalescer` Scala class for output-side batch coalescing
   
   **Performance benefits observed in TPC-H Q18 benchmarks:**
   - 10.9% overall query time improvement
   - Significantly reduced GC pressure (e.g., Stage 26: GC time dropped from 
3,602ms to 56ms)
   - Better vectorization efficiency for downstream operators
   
   ## Test plan
   
   - [ ] Verify existing unit tests pass
   - [ ] Run TPC-H Q18 benchmark with 
`spark.comet.shuffle.resizeBatches.input=true`
   - [ ] Verify GC metrics improve with the optimization enabled
   - [ ] Test with various batch sizes to ensure correct behavior
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to