Krishen Bhan created HUDI-7687:
----------------------------------

             Summary: Instant should not be archived until replaced file groups 
or older file versions are deleted
                 Key: HUDI-7687
                 URL: https://issues.apache.org/jira/browse/HUDI-7687
             Project: Apache Hudi
          Issue Type: Improvement
            Reporter: Krishen Bhan


When archival runs it may consider an instant as a candidate for archival even 
if the file groups said instant replaced/updated still need to undergo a 
`clean`. For example, consider the following scenario with clean and archived 
scheduled/executed independently in different jobs
 # Insert at C1 creates file group f1 in partition
 # Replacecommit at RC2 creates file group f2 in partition, and replaces f1
 # Any reader of partition that calls HUDI API (with or without using MDT) will 
recognize that f1 should be ignored, as it has been replaced. This is since RC2 
instant file is in active timeline
 # Some more instants are added to timeline. RC2 is now eligible to be cleaned 
(as per the table writers' clean policy). Assume though that file groups 
replaces by RC2 haven't been deleted yet, such as due to clean repeatedly 
failing, async clean not being scheduled yet, or the clean failing to delete 
said file groups.
 # An archive job eventually is triggered, and archives C1 and RC2. Note that 
f1 is still in partition

Now the table has the same consistency issue as seen in 
https://issues.apache.org/jira/browse/HUDI-7655 , where replaced file groups 
are still in partition and readers may see inconsistent data. 

 

This situation can be avoided by ensuring that archival will "block" and no go 
past an older instant time if it sees that said instant didn't undergo a clean 
yet. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to