Re: [I] Current shuffle format has too much overhead with default batch size [datafusion-comet]

via GitHub Wed, 08 Apr 2026 13:05:24 -0700


andygrove commented on issue #3882:
URL: 
https://github.com/apache/datafusion-comet/issues/3882#issuecomment-4209248288


   My current thinking is that there are two fixes needed for this issue.
   
   1. Shuffle block size should be based on batch size in bytes rather than 
number of rows. For batches with small number of columns, 8192 rows produces 
small blocks where the overhead of the serialization format really shows. Also, 
compression is less effective.
   2. Switch to IPC stream format so that schema is encoded once per partition 
rather than once per block. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Current shuffle format has too much overhead with default batch size [datafusion-comet]

Reply via email to