andygrove opened a new issue, #3882: URL: https://github.com/apache/datafusion-comet/issues/3882
### Describe the bug The current shuffle format writes each batch using the Arrow IPC Stream format, writing a single batch per stream instance, which means that the schema is encoded for each batch. There may also be overhead in creating a new compression codec for each batch. In one example, we have seen that with the default batch size that Comet shuffle files are 50% larger than Spark shuffle files, and overall query performance was 10% slower than Spark. After doubling the batch size, Comet shuffle files were only 8% larger than Spark and performance was 15% faster than Spark. Increasing the batch size consistently improves performance, but at the cost of downstream operators potentially using more memory, although we have not measured this. ### Steps to reproduce _No response_ ### Expected behavior _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
