agoncharuk commented on PR #1923:
URL: https://github.com/apache/fluss/pull/1923#issuecomment-3510835398
Sure, I definitely agree that using an adaptive approach is good and makes
sense. Also definitely agree on `TimeBoundaryAnalysis` you mentioned - the read
should be split into two non-overlapping datasets based on Fluss tiering
metadata.
The only thing I contest is whether `FlussLakehouseReader` is needed,
because once we understand that we need historical data:
* In your `FlussLakehouseReader` you mention that for e.g. Iceberg you
would need to instantiate an Iceberg catalog, list splits, do some filtering
based on partition column, etc. This would need to be implemented for each data
lake implementation inside `FlussLakehouseReader`, but instead can be delegated
to Trino lakehouse connectors
* If I understand correctly, the basic unit of parallelism (as per
`FlussSplit`) is a table bucket, needed to preserve ordering guarantees. Once
we get to the lakehouse, we likely read more data and can increase parallelism
based on e.g. parquet row groups, which is also already implemented in Trino
lakehouse connectors.
Or do you envision scenarios when Fluss connector cannot know in advance
whether historical data is needed or not at the planning time?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]