[I] Optimize native shuffle to write schema once per output partition instead of once per batch [datafusion-comet]

via GitHub Wed, 17 Dec 2025 07:16:16 -0800


andygrove opened a new issue, #2928:
URL: https://github.com/apache/datafusion-comet/issues/2928


   ### What is the problem the feature request solves?
   
   Native shuffle currently encodes the name of the compression codec and the 
IPC schema once per batch. Storing the schema per batch was originally a 
requirement because the schema could vary between batches due to dictionary 
encoding but this is no longer the case.
   
   ### Describe the potential solution
   
   Write the compression codec name and schema once per partittion. This can be 
implemented in `ShuffleBlockWriter::try_new`. Update 
`ShuffleBlockWriter::write_batch` to no longer write the header per batch.
   
   Make correspondong changes in the reader `NativeBlockDecoderIterator.scala`.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Optimize native shuffle to write schema once per output partition instead of once per batch [datafusion-comet]

Reply via email to