houqp commented on issue #970: URL: https://github.com/apache/arrow-datafusion/issues/970#issuecomment-947443126
I think with some extension to our existing table provider abstraction, this kind of cross table compute push down could be achieved within our logical or physical plan optimizer? Following @hu6360567 's logic of splitting logical plans into sub plans, we could perform in the plan optimizer to group the query plan tree into sub trees by database instance referenced in table scans. If a sub tree only reads tables from the same database, then we can safely convert that sub tree into a SQL query and push the query down to that database directly. This can be done by rewriting the sub tree into a single plan node that represents a remote SQL query execution. To achieve this, we need to extend the table providers to supply the following info: * database type (hard coded in the table provider code) * database compute capability (hard coded in the table provider code) * database logical plan to native query compiler (hard coded in the table provider code) * database name/identifier (provided as part of a user query) Database name/identifier and type helps the planner decide how to group plan into sub trees by database instance. Database compute capability helps the planner to further filter down on which subset of the sub tree can be pushed down. A trimmed down sub plan gets passed down from planner to the table provider's native query compiler to resolve the final query. It doesn't need to be sql and can be any native query supported by the corresponding database type. Lastly, the planner remove the trimmed down sub plan with a single remote query execution node with the compiled native query. All of the above can be handled within during the planning stage. >> but the biggest change would be allowing datasources to declare if they support certain plans. > That's why I convert sub plan back to sql (logical plans are distributed rather than phisical plans). At planner node, the general planner do not need to know the capability of underlying datasource. Do we really need to detect data source capability at runtime? Similar to how we define filter pushdown capability using `TableProvider::supports_filter_pushdown`, we could define these compute capability statically in the source code per database type right? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
