Skip to site navigation (Press enter)

Re: [I] Execute LogicalPlan on DBMS directly [arrow-datafusion]

via GitHub Sat, 04 Nov 2023 01:30:16 -0700


backkem commented on issue #970:
URL: 
https://github.com/apache/arrow-datafusion/issues/970#issuecomment-1793381505

I took a look at how Presto handles this. Presto uses the concept of a
`Connector` to represent remote data sources. DF's `TableProviderFactory` is
similar. Their `TableHandle` (Silimar to DF's `TableReference`) has a
[connectorID](https://github.com/prestodb/presto/blob/e9b4ed42c2ab5ec77acd380f3697ecd680737df7/presto-spi/src/main/java/com/facebook/presto/spi/TableHandle.java#L29).
This allows reasoning about which connector can handle parts of a query.

Aside from allowing table scans, the `Connector`can provide a [logical and
physical](https://github.com/prestodb/presto/blob/e9b4ed42c2ab5ec77acd380f3697ecd680737df7/presto-spi/src/main/java/com/facebook/presto/spi/connector/ConnectorPlanOptimizerProvider.java#L20)

[ConnectorPlanOptimizer](https://github.com/prestodb/presto/blob/e9b4ed42c2ab5ec77acd380f3697ecd680737df7/presto-spi/src/main/java/com/facebook/presto/spi/ConnectorPlanOptimizer.java#L30).
During plan optimization, Presto looks for the largest chunks of the plan that
are handled by a single connector and passes them to the respective
`ConnectorPlanOptimizer`. Some connectors take these plan chunks, turn them
into a query for their data source, and provide back an opaque `TableScanNode`.
Examples:
[JdbcComputePushdown](https://github.com/prestodb/presto/blob/e9b4ed42c2ab5ec77acd380f3697ecd680737df7/presto-base-jdbc/src/main/java/com/facebook/presto/plugin/jdbc/optimization/JdbcComputePushdown.java#L86)
or [ClickHouseComp

utePushdown](https://github.com/prestodb/presto/blob/e9b4ed42c2ab5ec77acd380f3697ecd680737df7/presto-clickhouse/src/main/java/com/facebook/presto/plugin/clickhouse/optimization/ClickHouseComputePushdown.java#L161).
This effectively results in query federation.

I expect it will likely be hard to find a good abstraction to express remote
compute capabilities, due to their high verity. Presto's approach seems
reasonable to allow granular federation without having to explicitly express
federation capabilities. It also limits the number of new concepts that need to
be added. DF can choose to ship basic remote providers (E.g. for
[ADBC](https://github.com/apache/arrow-datafusion/issues/7731)) and others can
be provided by 3rd parties.

Would this be a sensible approach to take?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]