Re: [PR] chore: Reimplement ShuffleWriterExec using interleave_record_batch [datafusion-comet]

via GitHub Wed, 12 Mar 2025 11:53:09 -0700


EmilyMatt commented on PR #1511:
URL: 
https://github.com/apache/datafusion-comet/pull/1511#issuecomment-2718819688


   @andygrove Indeed! I think this is a much better implementation, but it 
still maintains a huge amount of memory if I understand correctly, as no 
RecordBatch can be dropped until all of the PartitionIterators are finished 
with it, meaning data skew can make us keep a lot of idle memory.
   I do think on average this will probably balance itself out.
   And generally there's a give and take to be done with copying the data(which 
degrades performance, but can make us use much less memory, e.g., by using take 
on each record batch to partition it then letting each partition have its own 
row count), vs having the fastest shuffle possible, which in turn may cause 
undesired memory spikes in real life scenarios


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] chore: Reimplement ShuffleWriterExec using interleave_record_batch [datafusion-comet]

Reply via email to