[ 
https://issues.apache.org/jira/browse/HUDI-7687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Y Ethan Guo updated HUDI-7687:
------------------------------
    Fix Version/s: 1.0.2

> Instant should not be archived until replaced file groups or older file 
> versions are deleted
> --------------------------------------------------------------------------------------------
>
>                 Key: HUDI-7687
>                 URL: https://issues.apache.org/jira/browse/HUDI-7687
>             Project: Apache Hudi
>          Issue Type: Improvement
>            Reporter: Krishen Bhan
>            Assignee: sivabalan narayanan
>            Priority: Minor
>              Labels: archive, clean
>             Fix For: 1.0.2
>
>
> When archival runs it may consider an instant as a candidate for archival 
> even if the file groups said instant replaced/updated still need to undergo a 
> `clean`. For example, consider the following scenario with clean and archived 
> scheduled/executed independently in different jobs
>  # Insert at C1 creates file group f1 in partition
>  # Replacecommit at RC2 creates file group f2 in partition, and replaces f1
>  # Any reader of partition that calls HUDI API (with or without using MDT) 
> will recognize that f1 should be ignored, as it has been replaced. This is 
> since RC2 instant file is in active timeline
>  # Some more instants are added to timeline. RC2 is now eligible to be 
> cleaned (as per the table writers' clean policy). Assume though that file 
> groups replaces by RC2 haven't been deleted yet, such as due to clean 
> repeatedly failing, async clean not being scheduled yet, or the clean failing 
> to delete said file groups.
>  # An archive job eventually is triggered, and archives C1 and RC2. Note that 
> f1 is still in partition
> Now the table has the same consistency issue as seen in 
> https://issues.apache.org/jira/browse/HUDI-7655 , where replaced file groups 
> are still in partition and readers may see inconsistent data. 
>  
> This situation can be avoided by ensuring that archival will "block" and no 
> go past an older instant time if it sees that said instant didn't undergo a 
> clean yet. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to