Danny Chen created HUDI-2750:
--------------------------------

             Summary: Improve the incremental data files metadata more 
efficiently for streaming source
                 Key: HUDI-2750
                 URL: https://issues.apache.org/jira/browse/HUDI-2750
             Project: Apache Hudi
          Issue Type: Sub-task
          Components: Common Core
            Reporter: Danny Chen
             Fix For: 0.11.0


There are 3 ways for fetching the incremental data files for streaming read now:

1. Read the incremental commit metadata and resolve the data files to construct 
the inc filesystem view
2. Scan the filesystem directly and filter the data files with start commit 
time if the consuming starts from the 'earliest' offset
3. For 2, there is a more efficient way: to look up the metadata table if it is 
enabled

While these 3 ways are far away from enough for production:

for 1: there was a bottleneck when the start commit time has been far away from 
now, and the instants may have been archived, it takes too much time to load 
those metadata files, in our production, more than 30 minutes, which is 
unacceptable.

for 2&3: they are only suitable for cases that read the full history and 
incremental data set.

We better propose a way to look up the incremental data files with arbitrary 
time interval instants, to construct the filesystem efficiently.




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to