houqp commented on issue #970:
URL: 
https://github.com/apache/arrow-datafusion/issues/970#issuecomment-947443126


   I think with some extension to our existing table provider abstraction, this 
kind of cross table compute push down could be achieved within our logical or 
physical plan optimizer?
   
   Following @hu6360567 's logic of splitting logical plans into sub plans, we 
could perform in the plan optimizer to group the query plan tree into sub trees 
by database instance referenced in table scans. If a sub tree only reads tables 
from the same database, then we can safely convert that sub tree into a SQL 
query and push the query down to that database directly. This can be done by 
rewriting the sub tree into a single plan node that represents a remote SQL 
query execution.
   
   To achieve this, we need to extend the table providers to supply the 
following info:
   
   * database type (hard coded in the table provider code)
   * database compute capability (hard coded in the table provider code) 
   * database logical plan to native query compiler (hard coded in the table 
provider code)
   * database name/identifier (provided as part of a user query)
   
   Database name/identifier and type helps the planner decide how to group plan 
into sub trees by database instance.
   
   Database compute capability helps the planner to further filter down on 
which subset of the sub tree can be pushed down.
   
   A trimmed down sub plan gets passed down from planner to the table 
provider's native query compiler to resolve the final query. It doesn't need to 
be sql and can be any native query supported by the corresponding database type.
   
   Lastly, the planner remove the trimmed down sub plan with a single remote 
query execution node with the compiled native query.
   
   All of the above can be handled within during the planning stage.
   
   >> but the biggest change would be allowing datasources to declare if they 
support certain plans.
   
   > That's why I convert sub plan back to sql (logical plans are distributed 
rather than phisical plans).
   At planner node, the general planner do not need to know the capability of 
underlying datasource.
   
   Do we really need to detect data source capability at runtime? Similar to 
how we define filter pushdown capability using 
`TableProvider::supports_filter_pushdown`, we could define these compute 
capability statically in the source code per database type right?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to