[ 
https://issues.apache.org/jira/browse/HUDI-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17445162#comment-17445162
 ] 

Vinoth Chandar edited comment on HUDI-2750 at 11/17/21, 1:32 PM:
-----------------------------------------------------------------

+1 on this. Dumping my thoughts here.  When the start commit is far away, 2/3 
can be more performant, since they already filter out the files that have 
already been cleaned etc. Reading the entire timeline archive log can be time 
consuming. 

I think we can index the timeline as well and support efficient range 
retrievals. but wondering why you think 2/3 is just only suitable for full 
history reads? Is it because the log files don't have the delta commit instant 
today in their names? With these (at-least on object storage), we can figure 
out what files changes between any given interval, right?

Is this the gap?


was (Author: vc):
+1 on this. Dumping my thoughts here.  When the start commit is far away, 2/3 
can be more performant, since they already filter out the files that have 
already been cleaned etc. Reading the entire timeline archive log can be time 
consuming. 

I think we can index the timeline as well and support efficient range 
retrievals. but wondering why you think 2/3 is just only suitable for full 
history reads? Is it because the log files don't have the delta commit instant 
today in their names? With these (at-least on object storage), we can figure 
out what files changes between any given interval, right?

Is this the gap?

 

 

 

 

 

 

 

> Improve the incremental data files metadata more efficiently for streaming 
> source
> ---------------------------------------------------------------------------------
>
>                 Key: HUDI-2750
>                 URL: https://issues.apache.org/jira/browse/HUDI-2750
>             Project: Apache Hudi
>          Issue Type: Sub-task
>          Components: Common Core
>            Reporter: Danny Chen
>            Priority: Major
>             Fix For: 0.11.0
>
>
> There are 3 ways for fetching the incremental data files for streaming read 
> now:
> 1. Read the incremental commit metadata and resolve the data files to 
> construct the inc filesystem view
> 2. Scan the filesystem directly and filter the data files with start commit 
> time if the consuming starts from the 'earliest' offset
> 3. For 2, there is a more efficient way: to look up the metadata table if it 
> is enabled
> While these 3 ways are far away from enough for production:
> for 1: there was a bottleneck when the start commit time has been far away 
> from now, and the instants may have been archived, it takes too much time to 
> load those metadata files, in our production, more than 30 minutes, which is 
> unacceptable.
> for 2&3: they are only suitable for cases that read the full history and 
> incremental data set.
> We better propose a way to look up the incremental data files with arbitrary 
> time interval instants, to construct the filesystem efficiently.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to