Danny Chen created HUDI-3657:
--------------------------------
Summary: Unbound the restriction that clean retain commits must be
smaller than archive minimum commits
Key: HUDI-3657
URL: https://issues.apache.org/jira/browse/HUDI-3657
Project: Apache Hudi
Issue Type: Improvement
Components: core
Reporter: Danny Chen
Fix For: 0.11.0
The end-to-end streaming processing is more and more popular around the Flink
users now, and the most typical application scenario for streaming ingestion
checkpoint interval is within minutes (1min, 5mins ..). Say user sets up the
time-interval as 1 minute, and there are about 60 write commits on the timeline
for one hour.
{t1, t2, t3, t4 ...t60}
Now let's consider the very popular streaming read scenario, people want to
keep the history data for a medium live time(usually 1 day or even 1 week), and
let's say user configure the cleaning retain commits number as:
_1(day) * 24 (hours) * 60 (commits of one hour) _= *1440 commits*
While considering the current cleaning retain commits restriction:
_num_retain_commits < min_archive_commits_num_
We must keep at least 1440 commits on the active timeline, that means we have
at least:
_1440 * 3 = 4320_
files on the timeline !!! Which is a pressure to the file IO and the metadata
scanning (the metadata client). If we do not configure long enough retain time
commits, the writer may remove the old files and the reader encounter
{{FileNotFoundException}}.
So, we may find a way to lift restriction that active timeline commits number
must be greater than cleaning retain commits.
One way i can think of is that we remember the last committed cleaning instant
and only check that when cleaning (suitable for the hours cleaning strategy).
With num_commits cleaning strategy we may need to scan the archive timeline (or
metadata table if it is enabled ?)
Whatever a solution is eagerly needed now !
--
This message was sent by Atlassian Jira
(v8.20.1#820001)