[I] nit: Hash Partitioning Shuffle Writes Last Batches First [datafusion-comet]

via GitHub Sun, 20 Apr 2025 01:58:08 -0700


EmilyMatt opened a new issue, #1659:
URL: https://github.com/apache/datafusion-comet/issues/1659


   This is a nitpick, but I've noticed that in the new interleaved shuffle, 
when copying the data into the output data file, first the in-memory data is 
written to the file, and only then is the copy performed, while this is done in 
a shuffle, and therefor the block order is not guaranteed in the read stage, it 
still removes the partial ordering within the block, this can be easily 
remedied by moving the write to after the copy is done, without performance 
penalties or anything, then if the data was ordered before the shuffle, the 
block will be ordered as well.
   I believe this is also more inline with Spark's shuffle behaviour, where the 
first batches received are first to be written.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] nit: Hash Partitioning Shuffle Writes Last Batches First [datafusion-comet]

Reply via email to