[ 
https://issues.apache.org/jira/browse/HUDI-3657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3657:
-----------------------------
    Fix Version/s: 0.12.0
                       (was: 0.11.0)

> Unbound the restriction that clean retain commits must be smaller than 
> archive minimum commits
> ----------------------------------------------------------------------------------------------
>
>                 Key: HUDI-3657
>                 URL: https://issues.apache.org/jira/browse/HUDI-3657
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: core
>            Reporter: Danny Chen
>            Priority: Major
>             Fix For: 0.12.0
>
>
> The end-to-end streaming processing is more and more popular around the Flink 
> users now, and the most typical application scenario for streaming ingestion 
> checkpoint interval is within minutes (1min, 5mins ..). Say user sets up the 
> time-interval as 1 minute, and there are about 60 write commits on the 
> timeline for one hour.
> {t1, t2, t3, t4 ...t60}
> Now let's consider the very popular streaming read scenario, people want to 
> keep the history data for a medium live time(usually 1 day or even 1 week), 
> and let's say user configure the cleaning retain commits number as:
> _1(day) * 24 (hours) * 60 (commits of one hour) _= *1440 commits*
> While considering the current cleaning retain commits restriction:
> _num_retain_commits < min_archive_commits_num_
> We must keep at least 1440 commits on the active timeline, that means we have 
> at least:
> _1440 * 3 = 4320_
>  files on the timeline !!! Which is a pressure to the file IO and the 
> metadata scanning (the metadata client). If we do not configure long enough 
> retain time commits, the writer may remove the old files and the reader 
> encounter {{FileNotFoundException}}.
> So, we may find a way to lift restriction that active timeline commits number 
> must be greater than cleaning retain commits.
> One way i can think of is that we remember the last committed cleaning 
> instant and only check that when cleaning (suitable for the hours cleaning 
> strategy). With num_commits cleaning strategy we may need to scan the archive 
> timeline (or metadata table if it is enabled ?)
> Whatever a solution is eagerly needed now !



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to