[
https://issues.apache.org/jira/browse/HUDI-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764436#comment-17764436
]
Lin Liu commented on HUDI-2750:
-------------------------------
[~danny0405], [~vinoth] , since it has been a while since the task was filed,
can you please check the task description and see if anything needs to be
updated?
> Improve the incremental data files metadata more efficiently for streaming
> source
> ---------------------------------------------------------------------------------
>
> Key: HUDI-2750
> URL: https://issues.apache.org/jira/browse/HUDI-2750
> Project: Apache Hudi
> Issue Type: Task
> Components: Common Core
> Reporter: Danny Chen
> Assignee: Lin Liu
> Priority: Major
> Fix For: 1.0.0
>
>
> There are 3 ways for fetching the incremental data files for streaming read
> now:
> 1. Read the incremental commit metadata and resolve the data files to
> construct the inc filesystem view
> 2. Scan the filesystem directly and filter the data files with start commit
> time if the consuming starts from the 'earliest' offset
> 3. For 2, there is a more efficient way: to look up the metadata table if it
> is enabled
> While these 3 ways are far away from enough for production:
> for 1: there was a bottleneck when the start commit time has been far away
> from now, and the instants may have been archived, it takes too much time to
> load those metadata files, in our production, more than 30 minutes, which is
> unacceptable.
> for 2&3: they are only suitable for cases that read the full history and
> incremental data set.
> We better propose a way to look up the incremental data files with arbitrary
> time interval instants, to construct the filesystem efficiently.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)