andygrove commented on issue #3882: URL: https://github.com/apache/datafusion-comet/issues/3882#issuecomment-4209248288
My current thinking is that there are two fixes needed for this issue. 1. Shuffle block size should be based on batch size in bytes rather than number of rows. For batches with small number of columns, 8192 rows produces small blocks where the overhead of the serialization format really shows. Also, compression is less effective. 2. Switch to IPC stream format so that schema is encoded once per partition rather than once per block. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
