Dandandan opened a new pull request, #21550:
URL: https://github.com/apache/datafusion/pull/21550

   ## Which issue does this PR close?
   
   N/A - performance improvement
   
   ## Rationale for this change
   
   For hash repartitioning, each input batch is split into `num_partitions` 
small sub-batches. Previously, these small batches were sent individually 
through channels and coalesced on the output side (downstream) in 
`PerPartitionStream`. This caused unnecessary channel traffic and contention.
   
   ## What changes are included in this PR?
   
   - **Upstream coalescing for hash repartition**: Small partitioned 
sub-batches are now coalesced on the input side (`pull_from_input`) using 
`LimitedBatchCoalescer` before sending through channels. This means fewer, 
larger batches flow through the channels.
   - **Removed downstream coalescing from `PerPartitionStream`**: The 
`batch_coalescer` field, `poll_next_and_coalesce` method, and related logic are 
removed. `PerPartitionStream` is now a simple pass-through.
   - **Extracted `send_to_channel` helper**: The batch-sending logic (memory 
reservation, spilling, channel send) is extracted into a reusable method to 
avoid duplication.
   
   ## Are these changes tested?
   
   Yes - all 41 existing repartition tests pass. The 
`test_repartition_with_coalescing` test was updated to 
`test_hash_repartition_with_upstream_coalescing` to validate the new upstream 
coalescing behavior.
   
   ## Are there any user-facing changes?
   
   No functional changes. Output batches may have different sizes than before 
(upstream coalescing produces batches up to `batch_size`, while downstream 
coalescing also did so but at a different point in the pipeline).
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to