agoncharuk commented on PR #1923:
URL: https://github.com/apache/fluss/pull/1923#issuecomment-3510835398

   Sure, I definitely agree that using an adaptive approach is good and makes 
sense. Also definitely agree on `TimeBoundaryAnalysis` you mentioned - the read 
should be split into two non-overlapping datasets based on Fluss tiering 
metadata.
   The only thing I contest is whether `FlussLakehouseReader` is needed, 
because once we understand that we need historical data:
    * In your `FlussLakehouseReader` you mention that for e.g. Iceberg you 
would need to instantiate an Iceberg catalog, list splits, do some filtering 
based on partition column, etc. This would need to be implemented for each data 
lake implementation inside `FlussLakehouseReader`, but instead can be delegated 
to Trino lakehouse connectors
    * If I understand correctly, the basic unit of parallelism (as per 
`FlussSplit`) is a table bucket, needed to preserve ordering guarantees. Once 
we get to the lakehouse, we likely read more data and can increase parallelism 
based on e.g. parquet row groups, which is also already implemented in Trino 
lakehouse connectors.
   
   Or do you envision scenarios when Fluss connector cannot know in advance 
whether historical data is needed or not at the planning time?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to