[
https://issues.apache.org/jira/browse/HUDI-7779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17849672#comment-17849672
]
Danny Chen edited comment on HUDI-7779 at 7/19/24 10:41 AM:
------------------------------------------------------------
-I kind of feel the deletion of a savepoint should also trigger file clean
immedinately, right after the removing of the commit metadata. That would
simplify the mess a lot. And these 2 steps should belong to one trasanction.-
was (Author: danny0405):
I kind of feel the deletion of a savepoint should also trigger file clean
immedinately, right after the removing of the commit metadata. That would
simplify the mess a lot. And these 2 steps should belong to one trasanction.
> Guarding archival to not archive unintended commits
> ---------------------------------------------------
>
> Key: HUDI-7779
> URL: https://issues.apache.org/jira/browse/HUDI-7779
> Project: Apache Hudi
> Issue Type: Bug
> Components: archiving
> Reporter: sivabalan narayanan
> Assignee: Lokesh Jain
> Priority: Major
> Labels: pull-request-available
> Fix For: 0.16.0, 1.0.0
>
>
> Archiving commits from active timeline could lead to data consistency issues
> on some rarest of occasions. We should come up with proper guards to ensure
> we do not make such unintended archival.
>
> Major gap which we wanted to guard is:
> if someone disabled cleaner, archival should account for data consistency
> issues and ensure it bails out.
> We have a base guarding condition, where archival will stop at the earliest
> commit to retain based on latest clean commit metadata. But there are few
> other scenarios that needs to be accounted for.
>
> a. Keeping aside replace commits, lets dive into specifics for regular
> commits and delta commits.
> Say user configured clean commits to 4 and archival configs to 5 and 6. after
> t10, cleaner is supposed to clean up all file versions created at or before
> t6. Say cleaner did not run(for whatever reason for next 5 commits).
> Archival will certainly be guarded until earliest commit to retain based
> on latest clean commits.
> Corner case to consider:
> A savepoint was added to say t3 and later removed. and still the cleaner was
> never re-enabled. Even though archival would have been stopped at t3 (until
> savepoint is present),but once savepoint is removed, if archival is executed,
> it could archive commit t3. Which means, file versions tracked at t3 is still
> not yet cleaned by the cleaner.
> Reasoning:
> We are good here wrt data consistency. Up until cleaner runs next time, this
> older file versions might be exposed to the end-user. But time travel query
> is not intended for already cleaned up commits and hence this is not an
> issue. None of snapshot, time travel query or incremental query will run into
> issues as they are not supposed to poll for t3.
> At any later point, if cleaner is re-enabled, it will take care of cleaning
> up file versions tracked at t3 commit. Just that for interim period, some
> older file versions might still be exposed to readers.
>
> b. The more tricky part is when replace commits are involved. Since replace
> commit metadata in active timeline is what ensures the replaced file groups
> are ignored for reads, before archiving the same, cleaner is expected to
> clean them up fully. But are there chances that this could go wrong?
> Corner case to consider. Lets add onto above scenario, where t3 has a
> savepoint, and t4 is a replace commit which replaced file groups tracked in
> t3.
> Cleaner will skip cleaning up files tracked by t3(due to the presence of
> savepoint), but will clean up t4, t5 and t6. So, earliest commit to retain
> will be pointing to t6. And say savepoint for t3 is removed, but cleaner was
> disabled. In this state of the timeline, if archival is executed, (since
> t3.savepoint is removed), archival might archive t3 and t4.rc. This could
> lead to data duplicates as both replaced file groups and new file groups from
> t4.rc would be exposed as valid file groups.
>
> In other words, if we were to summarize the different scenarios:
> i. replaced file group is never cleaned up.
> - ECTR(Earliest commit to retain) is less than this.rc and we are good.
> ii. replaced file group is cleaned up.
> - ECTR is > this.rc and is good to archive.
> iii. tricky: ECTR moved ahead compared to this.rc, but due to savepoint, full
> clean up did not happen. After savepoint is removed, and when archival is
> executed, we should avoid archiving the rc of interest. This is the gap we
> don't account for as of now.
>
> We have 3 options to go about to solve this.
> Option A:
> Let Savepoint deletion flow take care of cleaning up the files its tracking.
> cons:
> Savepoint's responsibility is not removing any data files. So, from a single
> user responsibility rule, this may not be right. Also, this clean up might
> need to do what a clean planner might actually be doing. ie. build file
> system view, understand if its supposed to be cleaned up already, and then
> only clean up the files which are supposed to be cleaned up. For eg, if a
> file group has only one file slice, it should not be cleaned up and scenarios
> like this.
>
> Option B:
> Since archival is the one which might cause data consistency issues, why not
> archival do the clean up.
> We need to account for concurrent cleans, failure and retry scenarios etc.
> Also, we might need to build the file system view and then take a call
> whether something needs to be cleaned up before archiving something.
> Cons:
> Again, the single user responsibility rule might be broken. Would be neat if
> cleaner takes care of deleting data files and archival only takes care of
> deleting/archiving timeline files.
>
> Option C:
> Similar to how cleaner maintain EarliestCommitToRetain, let cleaner track
> another metadata named "EarliestCommitToArchive". Strictly speaking, earliest
> commit to retain is meant for incremental cleaner. Archival polls earliest
> commit to retain and additionally savepointed instants to add guard rails.
> So, why not cleaner do that work and track "EarliestCommitToArchive".
> Essentially the value will be either "EarliestCommitToRetain" or earliest
> savepointed commit earlier than EarliestCommitToRetain.
> By this way, archival does not need to do any additional polling (savepoint
> timeline for instance). It can just guard itself against
> "EarliestCommitToArchive".
>
> Impl nuances:
> Cleaner has to track savepointed instants as well to assist in deducing the
> "EarliestCommitToArchive".
>
> Pros:
> Archival does not need to build FSV and instead rely on clean commit
> metadata.
>
> Irrespective of any approach we take, also thinking if we should unify the
> enablement of archival and cleaner together.
> For eg, one cannot run just archival after disabling cleaner. Either both
> cleaner and archival are enabled or both are disabled. By this way, we do not
> let archival run w/o a clean. This atleast reduces the chances of such data
> consistencies, but still one of the above approach has to be fixed. Logically
> speaking, cleaner and archival should go hand in hand. There is no work for
> archival if cleaner has not run. Only if cleaner moved the "earliest commit
> to retain" (and earliest commit to archival), archival will have something to
> do. If not, very likely its a no-op. so, why not have one config to enable to
> disable both.
>
> More details on how cleaner can track or deduce "EarliestCommitToArchive"
> (Option C):
> Clean planner will be tracking the current savepointed timestamps as part of
> every plan. Also, it will find the difference b/w previously tracked
> savepointed timestamp list and ensure it triggers clean and updates
> "EarliestCommitToArchive" accordingly.
>
> Example illustration:
> Say this is the timeline.
> t20(starting of active timeline) ----- t30 (earliest commit to retain)
> ---- t35(latest)
>
> User added savepoint to t32. and say there are 5 more commits.
> next time when cleaner runs:
> t20(starting of active timeline) –- t32 (sp) – t35 (earliest commit to
> retain) ---- t40(latest)
> But cleaner also tracks that "t32" as savepointed timestamp list.
> And sets "EarliestCommitToArchive" to "t32".
> After archival
> t32 (sp) – t35 (earliest commit to retain) ---- t40(latest)
>
> And if the user deletes the savepoint of interest:
> t32(first entry in active timeline) – t35 (earliest commit to retain) ----
> t40(latest)
> and 4 new commits are added.
> t32(first entry in active timeline) – t35 (earliest commit to retain) ----
> t44(latest)
> Cleaner runs:
> Clean planner will poll current timeline for savepointed commits and compare
> it w/ previously tracked list.
> And so it deduces that t32 savepoint has been removed. If archival has not be
> run, t32.deltacommit will be activetimeline for sure. And so clean planner
> can deduce the partitions for incremental cleaner.
> Computes the "EarliestCommittoRetain" to say "t39".
> Updates "EarliestCommitToArchive" to also "t39" since there are no savepoints
> and archival is good to archive every commit up until t39.
>
> *Consensus:*
> We will go w/ Option C.
>
> Re-iterating the proposed fix:
> *Cleaner:*
> a. Cleaner will track savepointed timestamps during every plan. Compared it
> with previous clean plan and deduce if any savepoint is removed. If any
> removal is deduced, clean planning will happen for all partitions.
> b. Cleaner will also track additional metadata called
> "EarliestCommitToNotArchive". Whose value will either refer to first
> savepoint if first savepoint < earliest commit to retain. Or it will refer to
> "earliestCommitToRetain".
> *Archival:*
> Archival will guard itself against "EarliestCommitToNotArchive" from latest
> completed clean.
>
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)