alamb commented on PR #6929: URL: https://github.com/apache/arrow-datafusion/pull/6929#issuecomment-1636443457
> If it turns out that bounding memory usage inevitably reduces performance in a non-negligible way, I propose we introduce a configuration flag to control this. We can use the high-performance/unbounded behavior the default one, but one should still be able to choose the lower performance/bounded version for memory conscious use cases. I don't think we should ever be using unbounded memory ever if we can avoid it -- in this case if the producer goes faster than the consumer it will just buffer a huge amount of data (and eg will eventually OOM with TPCH SF100, or SF1000) I like @Dandandan 's suggestion to introduce more buffering Perhaps we could extend the existing DistributionSender to have a queue (2 or 3 for example) rather than just a single `Option<>` so that it was possible to start fetching the next input immediately https://github.com/apache/arrow-datafusion/blob/d316702722e6c301fdb23a9698f7ec415ef548e9/datafusion/core/src/physical_plan/repartition/distributor_channels.rs#L180-L182 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
