houqp commented on issue #970: URL: https://github.com/apache/arrow-datafusion/issues/970#issuecomment-947946521
Thanks @alamb for adding the diagrams, really helps to visualize the idea :) > We might even be able to do it without any changes to the TableProvider as of now. Question is without the table providers telling the planner that `Scan B` and `Scan C belongs to the same external database instance, how would it be able to decide whether these two scans can be combined into a single push down query? Imagine a case where Scan B is for a postgres database instance managed by AWS RDS, while scan C is for a postgres database instance managed in GCP, the planner need to be able to know they are from two different database instances so it can decide whether the join should happen within datafusion or gets pushed down. On top of that, we also need a way to let the planner know what compute plan nodes are supported by a particular database type. I think the table provider could be a good abstraction to provide this info. > The specifics of what types of subplans could be converted is probably specific to the database being pushed to, so it isn't clear that logic belongs in the main datafusion crate. I feel like this should be defined inside the table provider implementation, which should be maintained as plugins outside of datafusion core. Datafusion core should just maintain the current memory and listing table providers. Or maybe those two can be moved out of the core one day :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
