Re: [I] Add CatalogProvider API [datafusion-python]

via GitHub Fri, 25 Apr 2025 03:22:14 -0700


tespent commented on issue #1103:
URL: 
https://github.com/apache/datafusion-python/issues/1103#issuecomment-2830018437


   > if you can share, I'd like to learn more about the interplay of the 2 
systems.
   
   @aditanase Sure. I think my basic idea is quite similar to yours. But 
instead of wrap everything into a ray.data.Datasource, I execute SQL plan 
inside ray.data.Dataset.map_batches. This enable "sql-process"ing data from 
another ray.data.Dataset.
   
   > Are you using a simillar approach? Are you integrating the DF SQL and 
ray.data at a deeper level? More like smallpond?
   
   There are a lot of details inside that cannot be described easily. I am 
trying open-sourcing our code and share document soon (in one or two weeks, 
maybe). And I will submit a REP later to the Ray community.
   
   > This has the advantage of DF's load and processing speed, but once we get 
to joins/shuffles, we switch back to ray.
   
   I think we've met the same issue. My solution does not support join/shuffle 
for now, but I plan to integrate DF SQL deeper into ray.data, writing a few ray 
data operators to support shuffle, and using some optimizer rule to translate 
some operators (like global limit) into ray data operator.
   
   I am also interested in your ideas and solutions, will you like to share? 
thinks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Add CatalogProvider API [datafusion-python]

Reply via email to