[ 
https://issues.apache.org/jira/browse/HUDI-7490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7490:
--------------------------------------
    Description: 
We added a fix recently where cleaner will take care of cleaning up savepointed 
files too w/o fail with 

[https://github.com/apache/hudi/pull/10651] 

Scenario above patch fixes:

By default incremental cleaner is enabled. Cleaner during planning, will only 
account for partitions touched in recent commits (after earliest commit to 
retain from last completed clean). 

So, if there is a savepoint added and removed later on, cleaner might miss to 
take care of cleaning. So, we fixed the gap in above patch. 

Fix: Clean commit metadata will track savepointed commits. So, next time when 
clean planner runs, we find the mis-match b/w tracked savepointed commits and 
current savepoints from timeline and if there is a difference, cleaner will 
account for partittions touched by the savepointed commit.  

 

 

But we might have a gap wrt archival. 

If we ensure archival will run just after cleaning and not independently, we 
should be good.

but if there is a chance we could expose duplicate data to readers w/ below 
scenario. 

 

lets say we have a savepoint at t5.commit. So, cleaner skipped to delete the 
files created at t5 and went past it. and say we have a replace commit at t10 
which replaced all data files that were created at t5. 

w/ this state, say we removed the savepoint. 

we will have data files created by t5.commit in data directory. 

as long as t10 is in active timeline, readers will only see files written by 
t10 and will ignore files written by t5. 

at this juncture, if we run archival (w/o cleaner), archival might archive t5 
to t10. on which case both data files written by t5 and t10 will be exposed to 
readers. 

In most common deployment models, where we recommend to stop the pipeline while 
doing savepoint and restore or deleting savepoint, this might be uncommon. but 
there is a chance that this could happen. 

 

So, we have to guard the archival in this case. 

Essentially, we need to ensure before archiving a replace commit, the fileIds 
that were replaced are cleaned by the cleaner. 

 

Probable fix:

We can follow similar approach we followed in 
[https://github.com/apache/hudi/pull/10651]  . 

Essentially check for list of savepoints in current timeline and compare it w/ 
savepointed instants in latest clean commit metadata. If they match, we do not 
need to block archival. but if there is a difference (which means a savepoint 
was deleted in timeline and cleaner has not got a chance to cleanup yet), we 
should punt archiving anything and come back next time. 

 

 

 

 

  was:
We added a fix recently where cleaner will take care of cleaning up savepointed 
files too w/o fail with 

[https://github.com/apache/hudi/pull/10651] 

 

But we might have a gap wrt archival. 

If we ensure archival will run just after cleaning and not independently, we 
should be good.

but if there is a chance we could expose duplicate data to readers w/ below 
scenario. 

 

lets say we have a savepoint at t5.commit. So, cleaner skipped to delete the 
files created at t5 and went past it. and say we have a replace commit at t10 
which replaced all data files that were created at t5. 

w/ this state, say we removed the savepoint. 

we will have data files created by t5.commit in data directory. 

as long as t10 is in active timeline, readers will only see files written by 
t10 and will ignore files written by t5. 

at this juncture, if we run archival (w/o cleaner), archival might archive t5 
to t10. on which case both data files written by t5 and t10 will be exposed to 
readers. 

In most common deployment models, where we recommend to stop the pipeline while 
doing savepoint and restore or deleting savepoint, this might be uncommon. but 
there is a chance that this could happen. 

 

So, we have to guard the archival in this case. 

Essentially, we need to ensure before archiving a replace commit, the fileIds 
that were replaced are cleaned by the cleaner. 

 

Probable fix:

We can follow similar approach we followed in 
[https://github.com/apache/hudi/pull/10651]  . 

Essentially check for list of savepoints in current timeline and compare it w/ 
savepointed instants in latest clean commit metadata. If they match, we do not 
need to block archival. but if there is a difference (which means a savepoint 
was deleted in timeline and cleaner has not got a chance to cleanup yet), we 
should punt archiving anything and come back next time. 

 

 

 

 


> Fix archival guarding data files not yet cleaned up by cleaner when savepoint 
> is removed
> ----------------------------------------------------------------------------------------
>
>                 Key: HUDI-7490
>                 URL: https://issues.apache.org/jira/browse/HUDI-7490
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: archiving, cleaning, clustering
>            Reporter: sivabalan narayanan
>            Priority: Major
>
> We added a fix recently where cleaner will take care of cleaning up 
> savepointed files too w/o fail with 
> [https://github.com/apache/hudi/pull/10651] 
> Scenario above patch fixes:
> By default incremental cleaner is enabled. Cleaner during planning, will only 
> account for partitions touched in recent commits (after earliest commit to 
> retain from last completed clean). 
> So, if there is a savepoint added and removed later on, cleaner might miss to 
> take care of cleaning. So, we fixed the gap in above patch. 
> Fix: Clean commit metadata will track savepointed commits. So, next time when 
> clean planner runs, we find the mis-match b/w tracked savepointed commits and 
> current savepoints from timeline and if there is a difference, cleaner will 
> account for partittions touched by the savepointed commit.  
>  
>  
> But we might have a gap wrt archival. 
> If we ensure archival will run just after cleaning and not independently, we 
> should be good.
> but if there is a chance we could expose duplicate data to readers w/ below 
> scenario. 
>  
> lets say we have a savepoint at t5.commit. So, cleaner skipped to delete the 
> files created at t5 and went past it. and say we have a replace commit at t10 
> which replaced all data files that were created at t5. 
> w/ this state, say we removed the savepoint. 
> we will have data files created by t5.commit in data directory. 
> as long as t10 is in active timeline, readers will only see files written by 
> t10 and will ignore files written by t5. 
> at this juncture, if we run archival (w/o cleaner), archival might archive t5 
> to t10. on which case both data files written by t5 and t10 will be exposed 
> to readers. 
> In most common deployment models, where we recommend to stop the pipeline 
> while doing savepoint and restore or deleting savepoint, this might be 
> uncommon. but there is a chance that this could happen. 
>  
> So, we have to guard the archival in this case. 
> Essentially, we need to ensure before archiving a replace commit, the fileIds 
> that were replaced are cleaned by the cleaner. 
>  
> Probable fix:
> We can follow similar approach we followed in 
> [https://github.com/apache/hudi/pull/10651]  . 
> Essentially check for list of savepoints in current timeline and compare it 
> w/ savepointed instants in latest clean commit metadata. If they match, we do 
> not need to block archival. but if there is a difference (which means a 
> savepoint was deleted in timeline and cleaner has not got a chance to cleanup 
> yet), we should punt archiving anything and come back next time. 
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to