[
https://issues.apache.org/jira/browse/HUDI-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Raymond Xu updated HUDI-2432:
-----------------------------
Sprint: Hudi-Sprint-Jan-3, Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18,
Hudi-Sprint-Jan-25 (was: Hudi-Sprint-Jan-3, Hudi-Sprint-Jan-10,
Hudi-Sprint-Jan-18)
> Fix restore by adding a requested instant and restore plan
> ----------------------------------------------------------
>
> Key: HUDI-2432
> URL: https://issues.apache.org/jira/browse/HUDI-2432
> Project: Apache Hudi
> Issue Type: Task
> Reporter: sivabalan narayanan
> Assignee: sivabalan narayanan
> Priority: Blocker
> Labels: pull-request-available
> Fix For: 0.11.0
>
>
> Fix restore by adding a requested instant and restore plan
>
> Trying to see if we really need a plan. Dumping my thoughts here.
> Restore internally converts to N no of rollbacks. We fetch active instants in
> reverse order from timeline and trigger rollbacks 1 by 1. We have already
> have a patch fixing rollback to add rollback Plan in rollback.requested meta
> file. So, walking through failure scenarios.
>
> With restore, individual rollbacks are not published to timeline. So, if
> restore fails midway, in the 2nd attempt, only subset of rollback will be
> applied to metadata table(which got rolledback during the 2nd attempt). so,
> we need a plan for restore as well.
> But with our enhancement to rollback to publish a plan, Rollback.requested
> can't be skipped and we have to publish to timeline. So, here is what will
> happen w/o a restore plan.
>
> start restore
> rollback commit N
> rollback.requested for commit N// plan.
> execute rollback, but do not publish to timeline. so this will not
> get applied to metadata table.
> rollback commit N-1
> rollback.requested for commit N-1 // plan
> execute rollback, but do not publish to timeline. again, will not
> get applied to metadata table.
> .
> commit restore and publish. this will get applied to metadata table.
> Once we are done committing restore, we can remove all rollback.requested
> files if needed.
>
> Failure scenarios:
> If after 2 rollbacks, we fail.
> on re-attempt, we will process remaining commits only, since active timeline
> may not report commitN and commitN-1 as active. So, we can do something like
> below w/ a restore plan.
>
> 1. start restore
> 2. schedule rollback for all of them.
> serialize all commit instants that need to be rolledback along with
> the rollback plan. // by now, we would have created rollback.requested meta
> file for all commits that need to be rolled back.
> 3. now execute rollback one by one. // do not publish to timeline once
> done. also changes should not be applied to metadata table.
> 4. collect rollback commit metadata from all individual rollbacks and create
> the restore commit metadata. there could be some commits which was already
> rolledback, and for those, we need to manually create rollback metadata based
> on rollback plan. More details in next para. commit the restore and publish.
> only this will get applied to metadata table(which inturn will unwrap the
> individual rollback metadata and apply it to metadata table).
>
> Failures:
> if we fail after 2nd rollback:
> on 2nd attempt, we will look at retstore plan for all commits that needs to
> be rolledback. So, we can't really rollback the first 2 since they are
> already rolled back. And so, we will manually create rollback metadata from
> rollback.requested meta file. and for rest, we will follow the regular flow
> of executing actual rollback and collecting rollback metadata. Once complete,
> we will serialize all this info in restore metadata which gets applied to
> metadata table.
>
> Alternatives: But since restore anyway is a destructive operation and is
> advised to stop all processes, we do have an option to clean up metadata
> table and rebootstrap completely once restore is complete.
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)