sivabalan narayanan created HUDI-7490:
-----------------------------------------
Summary: Fix archival guarding data files not yet cleaned up by
cleaner when savepoint is removed
Key: HUDI-7490
URL: https://issues.apache.org/jira/browse/HUDI-7490
Project: Apache Hudi
Issue Type: Bug
Components: archiving, cleaning, clustering
Reporter: sivabalan narayanan
We added a fix recently where cleaner will take care of cleaning up savepointed
files too w/o fail with
[https://github.com/apache/hudi/pull/10651]
But we might have a gap wrt archival.
If we ensure archival will run just after cleaning and not independently, we
should be good.
but if there is a chance we could expose duplicate data to readers w/ below
scenario.
lets say we have a savepoint at t5.commit. So, cleaner skipped to delete the
files created at t5 and went past it. and say we have a replace commit at t10
which replaced all data files that were created at t5.
w/ this state, say we removed the savepoint.
we will have data files created by t5.commit in data directory.
as long as t10 is in active timeline, readers will only see files written by
t10 and will ignore files written by t5.
at this juncture, if we run archival (w/o cleaner), archival might archive t5
to t10. on which case both data files written by t5 and t10 will be exposed to
readers.
So, we have to guard the archival in this case.
Essentially, we need to ensure before archiving a replace commit, the fileIds
that were replaced are cleaned by the cleaner.
Probable fix:
We can follow similar approach we followed in
[https://github.com/apache/hudi/pull/10651] .
Essentially check for list of savepoints in current timeline and compare it w/
savepointed instants in latest clean commit metadata. If they match, we do not
need to block archival. but if there is a difference (which means a savepoint
was deleted in timeline and cleaner has not got a chance to cleanup yet), we
should punt archiving anything and come back next time.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)