[I] Current shuffle format has too much overhead with default batch size [datafusion-comet]

via GitHub Thu, 02 Apr 2026 05:40:17 -0700


andygrove opened a new issue, #3882:
URL: https://github.com/apache/datafusion-comet/issues/3882


   ### Describe the bug
   
   The current shuffle format writes each batch using the Arrow IPC Stream 
format, writing a single batch per stream instance, which means that the schema 
is encoded for each batch. There may also be overhead in creating a new 
compression codec for each batch.
   
   In one example, we have seen that with the default batch size that Comet 
shuffle files are 50% larger than Spark shuffle files, and overall query 
performance was 10% slower than Spark. After doubling the batch size, Comet 
shuffle files were only 8% larger than Spark and performance was 15% faster 
than Spark.
   
   Increasing the batch size consistently improves performance, but at the cost 
of downstream operators potentially using more memory, although we have not 
measured this.
   
   ### Steps to reproduce
   
   _No response_
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Current shuffle format has too much overhead with default batch size [datafusion-comet]

Reply via email to