[
https://issues.apache.org/jira/browse/HUDI-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450764#comment-17450764
]
Prashant Wason commented on HUDI-2458:
--------------------------------------
>> Archival in data table does not have any dependency on metadata table
>> compaction.
With the sync design,[ we validate the deltacommits to
read|[https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java#L293]]
for metadata table based on which completed commits are present in the
dataset. So dataset should never be archiving the instants before compaction on
metadata table.
> Relax compaction in metadata being fenced based on inflight requests in data
> table
> ----------------------------------------------------------------------------------
>
> Key: HUDI-2458
> URL: https://issues.apache.org/jira/browse/HUDI-2458
> Project: Apache Hudi
> Issue Type: Sub-task
> Reporter: sivabalan narayanan
> Assignee: sivabalan narayanan
> Priority: Major
> Fix For: 0.11.0
>
>
> Relax compaction in metadata being fenced based on inflight requests in data
> table.
> Compaction is metadata is triggered only if there are no inflight requests in
> data table. This might cause liveness problem since for very large
> deployments, we could either have compaction or clustering always in
> progress. So, we should try to see how we can relax this constraint.
>
> Proposal to remove this dependency:
> With recent addition of spurious deletes config, we can actually get away
> with this.
> As of now, we have 3 inter linked nuances.
> - Compaction in metadata may not kick in, if there are any inflight
> operations in data table.
> - Rollback when being applied to metadata table has a dependency on last
> compaction instant in metadata table. We might even throw exception if
> instant being rolledback is < latest metadata compaction instant time.
> - Archival in data table is fenced by latest compaction in metadata table.
>
> So, just incase data timeline has any dangling inflght operation (lets say
> someone tried clustering, and killed midway and did not ever attempt again),
> metadata compaction will never kick in at all for good. I need to check what
> does archival do for such inflight operations in data table though when it
> tries to archive near by commits.
>
> So, with spurious deletes support which we added recently, all these can be
> much simplified.
> Whenever we want to apply a rollback commit, we don't need to take different
> actions based on whether the commit being rolled back is already committed to
> metadata table or not. Just go ahead and apply the rollback. Merging of
> metadata payload records will take care of this. If the commit was already
> synced, final merged payload may not have spurious deletes. If the commit
> being rolledback was never committed to metadata, final merged payload may
> have some spurious deletes which we can ignore.
> With this, compaction in metadata does not need to have any dependency on
> inflight operations in data table.
> And we can loosen up the dependency of archival in data table on metadata
> table compaction as well.
> So, in summary, all the 3 dependencies quoted above will be moot if we go
> with this approach. Archival in data table does not have any dependency on
> metadata table compaction. Rollback when being applied to metadata table does
> not care about last metadata table compaction. Compaction in metadata table
> can proceed even if there are inflight operations in data table.
>
> Especially our logic to apply rollback metadata to metadata table will become
> a lot simpler and is easy to reason about.
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)