crepererum opened a new pull request, #4867: URL: https://github.com/apache/arrow-datafusion/pull/4867
# Which issue does this PR close? Closes #4865. # Rationale for this change The repartition operation had an unbounded buffer. This is not required in all cases and is even counterproductive since it will drive the input nodes to completion while potentially starving the output nodes and while filling up the buffers up to the memory limit (at which point it will just bail out). # What changes are included in this PR? A somewhat more sophisticated channel construct (distribution channels) that are only unbounded as long as at least one channel is empty. In practice (= for any reasonable repartition config) this will NOT lead to unbounded memory usage since virtually all partitions should eventually receive some data. # Are these changes tested? - all existing tests pass - extensive tests for the distribution construct # Are there any user-facing changes? Improved scheduling for the repartition operation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
