EmilyMatt commented on PR #1511: URL: https://github.com/apache/datafusion-comet/pull/1511#issuecomment-2718819688
@andygrove Indeed! I think this is a much better implementation, but it still maintains a huge amount of memory if I understand correctly, as no RecordBatch can be dropped until all of the PartitionIterators are finished with it, meaning data skew can make us keep a lot of idle memory. I do think on average this will probably balance itself out. And generally there's a give and take to be done with copying the data(which degrades performance, but can make us use much less memory, e.g., by using take on each record batch to partition it then letting each partition have its own row count), vs having the fastest shuffle possible, which in turn may cause undesired memory spikes in real life scenarios -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org