Dandandan opened a new pull request, #20741: URL: https://github.com/apache/datafusion/pull/20741
## Summary In hash repartitioning, input batches are split into sub-batches per output partition. With many partitions, these sub-batches can be very small (e.g. 8192 rows / 16 partitions = ~512 rows per partition). Previously each small sub-batch was materialized via `take_arrays` and sent immediately through the channel. This change: - Adds `hash_partition_indices` to `BatchPartitioner` that computes hash indices without materializing sub-batches - Uses Arrow's `BatchCoalescer` per output partition to accumulate rows via `push_batch_with_indices` until `target_batch_size` is reached - Extracts `send_batch` helper for the channel send logic - Flushes remaining buffered batches when input is exhausted **Benefits:** - Reduces channel traffic — fewer, larger batches instead of many small ones - Avoids intermediate `take_arrays` materialization (deferred to `BatchCoalescer`) - Output batches are properly sized to `target_batch_size` ## Test plan - All 41 repartition unit tests pass - All 6 repartition SQL logic tests pass - Coalesce tests pass - 0-column schema edge case handled (falls back to direct send) 🤖 Generated with [Claude Code](https://claude.com/claude-code) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
