backkem commented on issue #970: URL: https://github.com/apache/arrow-datafusion/issues/970#issuecomment-1793381505
I took a look at how Presto handles this. Presto uses the concept of a `Connector` to represent remote data sources. DF's `TableProviderFactory` is similar. Their `TableHandle` (Silimar to DF's `TableReference`) has a [connectorID](https://github.com/prestodb/presto/blob/e9b4ed42c2ab5ec77acd380f3697ecd680737df7/presto-spi/src/main/java/com/facebook/presto/spi/TableHandle.java#L29). This allows reasoning about which connector can handle parts of a query. Aside from allowing table scans, the `Connector`can provide a [logical and physical](https://github.com/prestodb/presto/blob/e9b4ed42c2ab5ec77acd380f3697ecd680737df7/presto-spi/src/main/java/com/facebook/presto/spi/connector/ConnectorPlanOptimizerProvider.java#L20) [ConnectorPlanOptimizer](https://github.com/prestodb/presto/blob/e9b4ed42c2ab5ec77acd380f3697ecd680737df7/presto-spi/src/main/java/com/facebook/presto/spi/ConnectorPlanOptimizer.java#L30). During plan optimization, Presto looks for the largest chunks of the plan that are handled by a single connector and passes them to the respective `ConnectorPlanOptimizer`. Some connectors take these plan chunks, turn them into a query for their data source, and provide back an opaque `TableScanNode`. Examples: [JdbcComputePushdown](https://github.com/prestodb/presto/blob/e9b4ed42c2ab5ec77acd380f3697ecd680737df7/presto-base-jdbc/src/main/java/com/facebook/presto/plugin/jdbc/optimization/JdbcComputePushdown.java#L86) or [ClickHouseComp utePushdown](https://github.com/prestodb/presto/blob/e9b4ed42c2ab5ec77acd380f3697ecd680737df7/presto-clickhouse/src/main/java/com/facebook/presto/plugin/clickhouse/optimization/ClickHouseComputePushdown.java#L161). This effectively results in query federation. I expect it will likely be hard to find a good abstraction to express remote compute capabilities, due to their high verity. Presto's approach seems reasonable to allow granular federation without having to explicitly express federation capabilities. It also limits the number of new concepts that need to be added. DF can choose to ship basic remote providers (E.g. for [ADBC](https://github.com/apache/arrow-datafusion/issues/7731)) and others can be provided by 3rd parties. Would this be a sensible approach to take? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
