[
https://issues.apache.org/jira/browse/HUDI-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rajesh Mahindra updated HUDI-2458:
----------------------------------
Sprint: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31 (was: Hudi-Sprint-Jan-24)
> Relax compaction in metadata being fenced based on inflight requests in data
> table
> ----------------------------------------------------------------------------------
>
> Key: HUDI-2458
> URL: https://issues.apache.org/jira/browse/HUDI-2458
> Project: Apache Hudi
> Issue Type: Task
> Reporter: sivabalan narayanan
> Assignee: Ethan Guo
> Priority: Blocker
> Fix For: 0.11.0
>
>
> Relax compaction in metadata being fenced based on inflight requests in data
> table.
> Compaction in metadata is triggered only if there are no inflight requests in
> data table. This might cause liveness problem since for very large
> deployments, we could either have compaction or clustering always in
> progress. So, we should try to see how we can relax this constraint.
>
> Proposal to remove this dependency:
> With recent addition of spurious deletes config, we can actually get away
> with this.
> As of now, we have 3 inter linked nuances.
> - Compaction in metadata may not kick in, if there are any inflight
> operations in data table.
> - Rollback when being applied to metadata table has a dependency on last
> compaction instant in metadata table. We might even throw exception if
> instant being rolledback is < latest metadata compaction instant time.
> - Archival in data table is fenced by latest compaction in metadata table.
>
> So, just incase data timeline has any dangling inflght operation (lets say
> someone tried clustering, and killed midway and did not ever attempt again),
> metadata compaction will never kick in at all for good. I need to check what
> does archival do for such inflight operations in data table though when it
> tries to archive near by commits.
>
> So, with spurious deletes support which we added recently, all these can be
> much simplified.
> Whenever we want to apply a rollback commit, we don't need to take different
> actions based on whether the commit being rolled back is already committed to
> metadata table or not. Just go ahead and apply the rollback. Merging of
> metadata payload records will take care of this. If the commit was already
> synced, final merged payload may not have spurious deletes. If the commit
> being rolledback was never committed to metadata, final merged payload may
> have some spurious deletes which we can ignore.
> With this, compaction in metadata does not need to have any dependency on
> inflight operations in data table.
> And we can loosen up the dependency of archival in data table on metadata
> table compaction as well.
> So, in summary, all the 3 dependencies quoted above will be moot if we go
> with this approach. Archival in data table does not have any dependency on
> metadata table compaction. Rollback when being applied to metadata table does
> not care about last metadata table compaction. Compaction in metadata table
> can proceed even if there are inflight operations in data table.
>
> Especially our logic to apply rollback metadata to metadata table will become
> a lot simpler and is easy to reason about.
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)