backkem commented on issue #970:
URL: 
https://github.com/apache/arrow-datafusion/issues/970#issuecomment-1793381505

   I took a look at how Presto handles this. Presto uses the concept of a 
`Connector` to represent remote data sources. DF's `TableProviderFactory` is 
similar. Their `TableHandle` (Silimar to DF's `TableReference`) has a 
[connectorID](https://github.com/prestodb/presto/blob/e9b4ed42c2ab5ec77acd380f3697ecd680737df7/presto-spi/src/main/java/com/facebook/presto/spi/TableHandle.java#L29).
 This allows reasoning about which connector can handle parts of a query.
   
   Aside from allowing table scans, the `Connector`can provide a [logical and 
physical](https://github.com/prestodb/presto/blob/e9b4ed42c2ab5ec77acd380f3697ecd680737df7/presto-spi/src/main/java/com/facebook/presto/spi/connector/ConnectorPlanOptimizerProvider.java#L20)
 
[ConnectorPlanOptimizer](https://github.com/prestodb/presto/blob/e9b4ed42c2ab5ec77acd380f3697ecd680737df7/presto-spi/src/main/java/com/facebook/presto/spi/ConnectorPlanOptimizer.java#L30).
 During plan optimization, Presto looks for the largest chunks of the plan that 
are handled by a single connector and passes them to the respective 
`ConnectorPlanOptimizer`. Some connectors take these plan chunks, turn them 
into a query for their data source, and provide back an opaque `TableScanNode`. 
Examples: 
[JdbcComputePushdown](https://github.com/prestodb/presto/blob/e9b4ed42c2ab5ec77acd380f3697ecd680737df7/presto-base-jdbc/src/main/java/com/facebook/presto/plugin/jdbc/optimization/JdbcComputePushdown.java#L86)
 or [ClickHouseComp
 
utePushdown](https://github.com/prestodb/presto/blob/e9b4ed42c2ab5ec77acd380f3697ecd680737df7/presto-clickhouse/src/main/java/com/facebook/presto/plugin/clickhouse/optimization/ClickHouseComputePushdown.java#L161).
 This effectively results in query federation.
   
   I expect it will likely be hard to find a good abstraction to express remote 
compute capabilities, due to their high verity. Presto's approach seems 
reasonable to allow granular federation without having to explicitly express 
federation capabilities. It also limits the number of new concepts that need to 
be added. DF can choose to ship basic remote providers (E.g. for 
[ADBC](https://github.com/apache/arrow-datafusion/issues/7731)) and others can 
be provided by 3rd parties.
   
   Would this be a sensible approach to take?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to