Dandandan opened a new pull request, #21551:
URL: https://github.com/apache/datafusion/pull/21551

   ## Which issue does this PR close?
   
   - N/A (performance improvement)
   
   ## Rationale for this change
   
   In `RepartitionExec`'s round-robin mode, batches are currently sent to 
partitions in strict sequential order. If a downstream consumer is slow, data 
piles up in that channel's buffer while other channels may be empty and idle. 
This causes unnecessary buffering and suboptimal throughput.
   
   ## What changes are included in this PR?
   
   - Added `is_empty()` method to `DistributionSender` to check if a channel's 
buffer is currently empty
   - Modified `pull_from_input` in `RepartitionExec`: in round-robin mode, 
before sending to the next partition in sequence, check if that channel has 
buffered data. If so, scan for an empty channel and send there instead. Falls 
back to the original partition if no empty channel is found.
   
   This makes round-robin repartitioning adaptive to varying consumer speeds 
while maintaining the same total data distribution.
   
   ## Are these changes tested?
   
   Yes, existing round-robin repartition tests are updated to validate total 
row counts across all partitions (rather than exact per-partition counts, which 
are now non-deterministic due to the adaptive behavior).
   
   ## Are there any user-facing changes?
   
   No API changes. Repartitioned data may be distributed differently across 
output partitions compared to before, but total row counts are preserved.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to