[I] Fix scalability limitations of current implementation [datafusion-ray]

via GitHub Sat, 16 Nov 2024 12:13:54 -0800


andygrove opened a new issue, #46:
URL: https://github.com/apache/datafusion-ray/issues/46


   Query execution works by building up a tree of futures to execute each 
partition in each query stage.
   
   The root node of each query stage is a `RayShuffleWriterExec`. It works by 
executing its child plan and fetching all of the results into memory (this is 
already not scalable because there could be millions or billions of rows). It 
then concatenates all of the batches into one large batch, which is returned. 
This large batch is then stored in Ray's object store and will be fetched by 
the next query stage.
   
   The original disk-based shuffle mechanism (that was removed in 
https://github.com/apache/datafusion-ray/pull/19) did not suffer from any of 
these issues because query results were streamed to disk in the writer and then 
streamed back out in the reader. However, this approach assumes that all 
workers have access to the same local file system.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Fix scalability limitations of current implementation [datafusion-ray]

Reply via email to