[jira] [Created] (HUDI-7490) Fix archival guarding data files not yet cleaned up by cleaner when savepoint is removed

sivabalan narayanan (Jira) Thu, 07 Mar 2024 17:06:11 -0800

sivabalan narayanan created HUDI-7490:
-----------------------------------------


             Summary: Fix archival guarding data files not yet cleaned up by 
cleaner when savepoint is removed
                 Key: HUDI-7490
                 URL: https://issues.apache.org/jira/browse/HUDI-7490
             Project: Apache Hudi
          Issue Type: Bug
          Components: archiving, cleaning, clustering
            Reporter: sivabalan narayanan


We added a fix recently where cleaner will take care of cleaning up savepointed 
files too w/o fail with 

[https://github.com/apache/hudi/pull/10651] 

 

But we might have a gap wrt archival. 

If we ensure archival will run just after cleaning and not independently, we 
should be good.

but if there is a chance we could expose duplicate data to readers w/ below 
scenario. 

 

lets say we have a savepoint at t5.commit. So, cleaner skipped to delete the 
files created at t5 and went past it. and say we have a replace commit at t10 
which replaced all data files that were created at t5. 

w/ this state, say we removed the savepoint. 

we will have data files created by t5.commit in data directory. 

as long as t10 is in active timeline, readers will only see files written by 
t10 and will ignore files written by t5. 

at this juncture, if we run archival (w/o cleaner), archival might archive t5 
to t10. on which case both data files written by t5 and t10 will be exposed to 
readers. 

So, we have to guard the archival in this case. 

Essentially, we need to ensure before archiving a replace commit, the fileIds 
that were replaced are cleaned by the cleaner. 

 

Probable fix:

We can follow similar approach we followed in 
[https://github.com/apache/hudi/pull/10651]  . 

Essentially check for list of savepoints in current timeline and compare it w/ 
savepointed instants in latest clean commit metadata. If they match, we do not 
need to block archival. but if there is a difference (which means a savepoint 
was deleted in timeline and cleaner has not got a chance to cleanup yet), we 
should punt archiving anything and come back next time. 

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7490) Fix archival guarding data files not yet cleaned up by cleaner when savepoint is removed

Reply via email to