Danny Chen created HUDI-3657:
--------------------------------

             Summary: Unbound the restriction that clean retain commits must be 
smaller than archive minimum commits
                 Key: HUDI-3657
                 URL: https://issues.apache.org/jira/browse/HUDI-3657
             Project: Apache Hudi
          Issue Type: Improvement
          Components: core
            Reporter: Danny Chen
             Fix For: 0.11.0


The end-to-end streaming processing is more and more popular around the Flink 
users now, and the most typical application scenario for streaming ingestion 
checkpoint interval is within minutes (1min, 5mins ..). Say user sets up the 
time-interval as 1 minute, and there are about 60 write commits on the timeline 
for one hour.

{t1, t2, t3, t4 ...t60}

Now let's consider the very popular streaming read scenario, people want to 
keep the history data for a medium live time(usually 1 day or even 1 week), and 
let's say user configure the cleaning retain commits number as:

_1(day) * 24 (hours) * 60 (commits of one hour) _= *1440 commits*

While considering the current cleaning retain commits restriction:

_num_retain_commits < min_archive_commits_num_

We must keep at least 1440 commits on the active timeline, that means we have 
at least:

_1440 * 3 = 4320_

 files on the timeline !!! Which is a pressure to the file IO and the metadata 
scanning (the metadata client). If we do not configure long enough retain time 
commits, the writer may remove the old files and the reader encounter 
{{FileNotFoundException}}.

So, we may find a way to lift restriction that active timeline commits number 
must be greater than cleaning retain commits.

One way i can think of is that we remember the last committed cleaning instant 
and only check that when cleaning (suitable for the hours cleaning strategy). 
With num_commits cleaning strategy we may need to scan the archive timeline (or 
metadata table if it is enabled ?)

Whatever a solution is eagerly needed now !








--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to