[
https://issues.apache.org/jira/browse/HUDI-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
sivabalan narayanan updated HUDI-2432:
--------------------------------------
Description:
Fix restore by adding a requested instant and restore plan
Trying to see if we really need a plan. Dumping my thoughts here.
Restore internally converts to N no of rollbacks. We fetch active instants in
reverse order from timeline and trigger rollbacks 1 by 1. We have already have
a patch fixing rollback to add rollback Plan in rollback.requested meta file.
So, walking through failure scenarios.
With restore, individual rollbacks are not published to timeline. So, if
restore fails midway, in the 2nd attempt, only subset of rollback will be
applied to metadata table(which got rolledback during the 2nd attempt). so, we
need a plan for restore as well.
But with our enhancement to rollback to publish a plan, Rollback.requested
can't be skipped and we have to publish to timeline. So, here is what will
happen w/o a restore plan.
start restore
rollback commit N
rollback.requested for commit N// plan.
execute rollback, but do not publish to timeline. so this will not
get applied to metadata table.
rollback commit N-1
rollback.requested for commit N-1 // plan
execute rollback, but do not publish to timeline. again, will not get
applied to metadata table.
.
commit restore and publish. this will get applied to metadata table.
Once we are done committing restore, we can remove all rollback.requested files
if needed.
Failure scenarios:
If after 2 rollbacks, we fail.
on re-attempt, we will process remaining commits since active timeline may not
report commitN and commitN-1 as active. So, we can do something like below w/ a
restore plan.
start restore
schedule rollback for all of them.
serialize all commit instants that need to be rolledback along with the
rollback plan. // by now, we would have created rollback.requested meta file
for all commits that need to be rolled back.
now execute rolback one by one. // do not publish to timeline once done.
also changes should not be applied to metadata table.
collect rollback commit metadata from all individual rollbacks and create the
restore commit metadata. there could be some commits which was already
rolledback, and for those, we need to manually create rollback metadata based
on rollback plan. More details in next para. commit restore and publish. this
will get applied to metadata table.
Failures:
if we fail after 2nd rollback:
on 2nd attempt, we will look at retstore plan for all commits that needs to be
rolledback. So, we can't really rollback the first 2 since they are already
rolled back. And so, we will manually create rollback metadata from
rollback.requested meta file. and for rest, we will follow the regular flow of
executing actual rollback and collecting rollback metadata. Once complete, we
will serialize all this info in restore metadata which gets applied to metadata
table.
Alternatives: But since restore anyway is a destructive operation and is
advised to stop all processes, we do have an option to clean up metadata table
and rebootstrap completely once restore is complete.
was:
Fix restore by adding a requested instant and restore plan
Trying to see if we really need a plan. Dumping my thoughts here.
Restore internally converts to N no of rollbacks. We fetch active instants in
reverse order from timeline and trigger rollbacks 1 by 1. We have already have
a patch fixing rollback to add rollback Plan in rollback.requested meta file.
So, walking through failure scenarios.
If 5 instants need to be rolledback, but process crashed after 3 rollbacks.
* When we retry restore 2nd time, only pending 2 will be returned from
timeline for instants that need to be rolledback. And so we will rollback
remaining 2 commits/instants. Only missing piece will be the list of rollback
metadata that gets serialized as part of restore commit metadata might miss
first 3 commits. Anyways, restore is a destructive operation, not sure if not
serializing the already rolledback commit to restore commit metadata will cause
any issues.
** Metadata table: first 3 would have been rolledback in metadata table as
well (applied as upsert). and so should be fine when we retrigger the restore.
the rest 2 will get applied.
* If there was a crash during a rollback was inflight.
** let's say rollback of c3 failed while in progress. when we re-attempt
restore, we will again try to rollback c3 again. With the fix for rollback plan
in place, we should be good as we will continue the rollback and get it to
completion. and then go on to rollback C2 and C1.
** Metadata table: for first time, since the rollback of C3 failed while
inflight, there won't be any trace of this in metadata table. but when we retry
for 2nd time, this should get applied to metadata table. the rollback plan fix
should ensure rollback commit metadata has all file info from original plan and
not just the successfully deleted ones. bcoz, in this case, during 2nd time,
only pending files will be deleted.
** If by chance, one of the rollback gets committted to metadata table and
failed before getting committed to data table: the 2nd time rollback of same
instant is yet another delta commit to metadata table and so we should be good
there too. we might instruct metadata table to delete files repeatedly may be.
Update:
I didn't realize that individual rollbacks are not published to timeline as
part of restore. So, if restore fails midway, in the 2nd attempt, only subset
of rollback will be applied to metadata table(which got rolledback during the
2nd attempt). so, we need a plan for restore as well. But some details on how
to go about this. bcoz, if reattempted, for the ones which were rolled back
during 1st attempt, we should skip rolling back again, but fetch the commit
metadata from rollback.completed file and add it to restore metadata.
Alternatives: But since restore anyway is a destructive operation and is
advised to stop all processes, we do have an option to clean up metadata table
and rebootstrap completely once restore is complete.
> Fix restore by adding a requested instant and restore plan
> ----------------------------------------------------------
>
> Key: HUDI-2432
> URL: https://issues.apache.org/jira/browse/HUDI-2432
> Project: Apache Hudi
> Issue Type: Sub-task
> Reporter: sivabalan narayanan
> Assignee: sivabalan narayanan
> Priority: Major
> Fix For: 0.10.0
>
>
> Fix restore by adding a requested instant and restore plan
>
> Trying to see if we really need a plan. Dumping my thoughts here.
> Restore internally converts to N no of rollbacks. We fetch active instants in
> reverse order from timeline and trigger rollbacks 1 by 1. We have already
> have a patch fixing rollback to add rollback Plan in rollback.requested meta
> file. So, walking through failure scenarios.
>
> With restore, individual rollbacks are not published to timeline. So, if
> restore fails midway, in the 2nd attempt, only subset of rollback will be
> applied to metadata table(which got rolledback during the 2nd attempt). so,
> we need a plan for restore as well.
> But with our enhancement to rollback to publish a plan, Rollback.requested
> can't be skipped and we have to publish to timeline. So, here is what will
> happen w/o a restore plan.
>
> start restore
> rollback commit N
> rollback.requested for commit N// plan.
> execute rollback, but do not publish to timeline. so this will not
> get applied to metadata table.
> rollback commit N-1
> rollback.requested for commit N-1 // plan
> execute rollback, but do not publish to timeline. again, will not
> get applied to metadata table.
> .
> commit restore and publish. this will get applied to metadata table.
> Once we are done committing restore, we can remove all rollback.requested
> files if needed.
>
> Failure scenarios:
> If after 2 rollbacks, we fail.
> on re-attempt, we will process remaining commits since active timeline may
> not report commitN and commitN-1 as active. So, we can do something like
> below w/ a restore plan.
>
> start restore
> schedule rollback for all of them.
> serialize all commit instants that need to be rolledback along with the
> rollback plan. // by now, we would have created rollback.requested meta file
> for all commits that need to be rolled back.
> now execute rolback one by one. // do not publish to timeline once
> done. also changes should not be applied to metadata table.
> collect rollback commit metadata from all individual rollbacks and create the
> restore commit metadata. there could be some commits which was already
> rolledback, and for those, we need to manually create rollback metadata based
> on rollback plan. More details in next para. commit restore and publish. this
> will get applied to metadata table.
>
> Failures:
> if we fail after 2nd rollback:
> on 2nd attempt, we will look at retstore plan for all commits that needs to
> be rolledback. So, we can't really rollback the first 2 since they are
> already rolled back. And so, we will manually create rollback metadata from
> rollback.requested meta file. and for rest, we will follow the regular flow
> of executing actual rollback and collecting rollback metadata. Once complete,
> we will serialize all this info in restore metadata which gets applied to
> metadata table.
>
> Alternatives: But since restore anyway is a destructive operation and is
> advised to stop all processes, we do have an option to clean up metadata
> table and rebootstrap completely once restore is complete.
>
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)