Krishen Bhan created HUDI-7687:
----------------------------------
Summary: Instant should not be archived until replaced file groups
or older file versions are deleted
Key: HUDI-7687
URL: https://issues.apache.org/jira/browse/HUDI-7687
Project: Apache Hudi
Issue Type: Improvement
Reporter: Krishen Bhan
When archival runs it may consider an instant as a candidate for archival even
if the file groups said instant replaced/updated still need to undergo a
`clean`. For example, consider the following scenario with clean and archived
scheduled/executed independently in different jobs
# Insert at C1 creates file group f1 in partition
# Replacecommit at RC2 creates file group f2 in partition, and replaces f1
# Any reader of partition that calls HUDI API (with or without using MDT) will
recognize that f1 should be ignored, as it has been replaced. This is since RC2
instant file is in active timeline
# Some more instants are added to timeline. RC2 is now eligible to be cleaned
(as per the table writers' clean policy). Assume though that file groups
replaces by RC2 haven't been deleted yet, such as due to clean repeatedly
failing, async clean not being scheduled yet, or the clean failing to delete
said file groups.
# An archive job eventually is triggered, and archives C1 and RC2. Note that
f1 is still in partition
Now the table has the same consistency issue as seen in
https://issues.apache.org/jira/browse/HUDI-7655 , where replaced file groups
are still in partition and readers may see inconsistent data.
This situation can be avoided by ensuring that archival will "block" and no go
past an older instant time if it sees that said instant didn't undergo a clean
yet.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)