[
https://issues.apache.org/jira/browse/HUDI-7687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Y Ethan Guo updated HUDI-7687:
------------------------------
Fix Version/s: 1.0.2
> Instant should not be archived until replaced file groups or older file
> versions are deleted
> --------------------------------------------------------------------------------------------
>
> Key: HUDI-7687
> URL: https://issues.apache.org/jira/browse/HUDI-7687
> Project: Apache Hudi
> Issue Type: Improvement
> Reporter: Krishen Bhan
> Assignee: sivabalan narayanan
> Priority: Minor
> Labels: archive, clean
> Fix For: 1.0.2
>
>
> When archival runs it may consider an instant as a candidate for archival
> even if the file groups said instant replaced/updated still need to undergo a
> `clean`. For example, consider the following scenario with clean and archived
> scheduled/executed independently in different jobs
> # Insert at C1 creates file group f1 in partition
> # Replacecommit at RC2 creates file group f2 in partition, and replaces f1
> # Any reader of partition that calls HUDI API (with or without using MDT)
> will recognize that f1 should be ignored, as it has been replaced. This is
> since RC2 instant file is in active timeline
> # Some more instants are added to timeline. RC2 is now eligible to be
> cleaned (as per the table writers' clean policy). Assume though that file
> groups replaces by RC2 haven't been deleted yet, such as due to clean
> repeatedly failing, async clean not being scheduled yet, or the clean failing
> to delete said file groups.
> # An archive job eventually is triggered, and archives C1 and RC2. Note that
> f1 is still in partition
> Now the table has the same consistency issue as seen in
> https://issues.apache.org/jira/browse/HUDI-7655 , where replaced file groups
> are still in partition and readers may see inconsistent data.
>
> This situation can be avoided by ensuring that archival will "block" and no
> go past an older instant time if it sees that said instant didn't undergo a
> clean yet.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)