[GitHub] [arrow-datafusion] crepererum opened a new pull request, #4867: refactor: improve repartition buffering

GitBox Tue, 10 Jan 2023 03:47:03 -0800


crepererum opened a new pull request, #4867:
URL: https://github.com/apache/arrow-datafusion/pull/4867


   # Which issue does this PR close?
   Closes #4865.
   
   # Rationale for this change
   The repartition operation had an unbounded buffer. This is not required in 
all cases and is even counterproductive since it will drive the input nodes to 
completion while potentially starving the output nodes and while filling up the 
buffers up to the memory limit (at which point it will just bail out).
   
   # What changes are included in this PR?
   A somewhat more sophisticated channel construct (distribution channels) that 
are only unbounded as long as at least one channel is empty. In practice (= for 
any reasonable repartition config) this will NOT lead to unbounded memory usage 
since virtually all partitions should eventually receive some data.
   
   # Are these changes tested?
   - all existing tests pass
   - extensive tests for the distribution construct
   
   # Are there any user-facing changes?
   Improved scheduling for the repartition operation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] crepererum opened a new pull request, #4867: refactor: improve repartition buffering

Reply via email to