tespent commented on issue #1103: URL: https://github.com/apache/datafusion-python/issues/1103#issuecomment-2830018437
> if you can share, I'd like to learn more about the interplay of the 2 systems. @aditanase Sure. I think my basic idea is quite similar to yours. But instead of wrap everything into a ray.data.Datasource, I execute SQL plan inside ray.data.Dataset.map_batches. This enable "sql-process"ing data from another ray.data.Dataset. > Are you using a simillar approach? Are you integrating the DF SQL and ray.data at a deeper level? More like smallpond? There are a lot of details inside that cannot be described easily. I am trying open-sourcing our code and share document soon (in one or two weeks, maybe). And I will submit a REP later to the Ray community. > This has the advantage of DF's load and processing speed, but once we get to joins/shuffles, we switch back to ray. I think we've met the same issue. My solution does not support join/shuffle for now, but I plan to integrate DF SQL deeper into ray.data, writing a few ray data operators to support shuffle, and using some optimizer rule to translate some operators (like global limit) into ray data operator. I am also interested in your ideas and solutions, will you like to share? thinks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org