[jira] [Commented] (HUDI-2750) Improve the incremental data files metadata more efficiently for streaming source

Lin Liu (Jira) Tue, 12 Sep 2023 16:55:07 -0700


    [ 
https://issues.apache.org/jira/browse/HUDI-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764436#comment-17764436
 ]


Lin Liu commented on HUDI-2750:
-------------------------------

[~danny0405], [~vinoth] , since it has been a while since the task was filed, 
can you please check the task description and see if anything needs to be 
updated? 

> Improve the incremental data files metadata more efficiently for streaming 
> source
> ---------------------------------------------------------------------------------
>
>                 Key: HUDI-2750
>                 URL: https://issues.apache.org/jira/browse/HUDI-2750
>             Project: Apache Hudi
>          Issue Type: Task
>          Components: Common Core
>            Reporter: Danny Chen
>            Assignee: Lin Liu
>            Priority: Major
>             Fix For: 1.0.0
>
>
> There are 3 ways for fetching the incremental data files for streaming read 
> now:
> 1. Read the incremental commit metadata and resolve the data files to 
> construct the inc filesystem view
> 2. Scan the filesystem directly and filter the data files with start commit 
> time if the consuming starts from the 'earliest' offset
> 3. For 2, there is a more efficient way: to look up the metadata table if it 
> is enabled
> While these 3 ways are far away from enough for production:
> for 1: there was a bottleneck when the start commit time has been far away 
> from now, and the instants may have been archived, it takes too much time to 
> load those metadata files, in our production, more than 30 minutes, which is 
> unacceptable.
> for 2&3: they are only suitable for cases that read the full history and 
> incremental data set.
> We better propose a way to look up the incremental data files with arbitrary 
> time interval instants, to construct the filesystem efficiently.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HUDI-2750) Improve the incremental data files metadata more efficiently for streaming source

Reply via email to