[I] Implement shuffle using Ray object store [datafusion-ray]

via GitHub Sat, 14 Dec 2024 11:40:37 -0800


andygrove opened a new issue, #55:
URL: https://github.com/apache/datafusion-ray/issues/55


   We want to reimplement the shuffle mechanism and have it use the Ray object 
store to store the shuffle output.
   
   Here are some notes on what this may look like.
   
   ShuffleWriterExec and ShuffleReaderExec should be re-implements in Python 
rather than Rust (although they can still call Rust code where needed).
   
   ## Planning
   
   The reader needs to know how to find the output from the writer, so we 
perhaps need to add some extra data to the distributed plan, such as a UUID for 
each query stage.
   
   ## Writer
   
   The new ShuffleWriterExec should execute its child plan to get a stream of 
record batches and iterate over them and repartition them according to the 
shuffle partitioning schema.
   
   The smaller repartitioned batches should be written to the object store 
(although we may want to buffer them in memory first until they are a certain 
size).
   
   When all batches have been processed, we should be able to write a final 
object that contains references to the other objects. 
   
   ## Reader
   
   The reader will need to find the final object stored by the writer so that 
it can then load the other batches, We probably need to remove the objects once 
they have been read.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Implement shuffle using Ray object store [datafusion-ray]

Reply via email to