kaori-seasons commented on PR #1923:
URL: https://github.com/apache/fluss/pull/1923#issuecomment-3509311051
> Hi @kaori-seasons, we are very interested in this feature so thank you for
driving this effort!
>
> I am curious about the way you are going to support the union reads in
Trino: I see that in this draft you were going to implement a custom page
source that will delegate reads to the corresponding lakehouse. However, I
think that it will require quite a lot of effort, plus this will likely require
re-implementing a lot of other features (like dynamic filters, partition
pruning, etc) that are already present in existing lakehouse connectors. I was
thinking to discuss a different approach. Let's say if Trino supported the
union read on the API level, something along the following idea:
>
> 1. The `Metadata` interface would have another method `splitForUnionRead`
which would return a Trino table path + table version (snapshot ID) that should
be used for the union read
> 2. When this result is returned to Trino planner, it would replace the
table scan with a `Union` node, which has the fluss table scan as one leg, and
lakehouse table scan as another leg. A plugin cannot instantiate a TableHandle
of another plugin directly, but this can be implemented inside a Trino planner
rule
> 3. Trino will use all existing optimizations for the corresponding
connector: filters pushdown, partition pruning, etc and it will handle the
interaction with data lakes - no other steps are required from the Fluss
connector
>
> I think this change may benefit other connectors that try to offload
storage to different sources as well (I can imagine building e.g. a postgres
CDC to a lakehouse)
>
> Do you think this is a viable approach and it is worth discsussing it with
Trino community?
Hi @agoncharuk, I'm glad you're interested in this feature. I apologize for
the late reply, as I was away on a business trip over the weekend. In general,
I agree with your suggestion, but the reason I implemented the dynamic filter
and partition pruning features is to support an adaptive strategy. This allows
me to judge data freshness based on the actual business scenario. Here's a
compromise I'm not sure if you'd be okay with:
1. **Retain the advantages of the existing implementation:** Keep the
intelligent strategy selection and adaptive optimization features.
2. **Enhance integration with the Trino planner:** Add a suggestion method
to the Metadata interface, but still retain our optimization logic.
``java
// A similar method can be added to FlussMetadata
public interface FlussMetadata {
// Existing methods...
// Add a new method to support union reads at the Trino level
default Optional<UnionReadInfo> splitForUnionRead(String tableName) {
// Returns the Trino table path + table version (snapshot ID)
// This allows the Trino planner to create Union nodes
return Optional.empty();
}
}
```
What are your thoughts on this?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]