[jira] [Updated] (HUDI-2432) Fix restore by adding a requested instant and restore plan

sivabalan narayanan (Jira) Thu, 16 Sep 2021 11:48:05 -0700


     [ 
https://issues.apache.org/jira/browse/HUDI-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


sivabalan narayanan updated HUDI-2432:
--------------------------------------
    Description: 
Fix restore by adding a requested instant and restore plan

 

Trying to see if we really need a plan. Dumping my thoughts here. 

Restore internally converts to N no of rollbacks. We fetch active instants in 
reverse order from timeline and trigger rollbacks 1 by 1. We have already have 
a patch fixing rollback to add rollback Plan in rollback.requested meta file. 
So, walking through failure scenarios. 

 

With restore, individual rollbacks are not published to timeline. So, if 
restore fails midway, in the 2nd attempt, only subset of rollback will be 
applied to metadata table(which got rolledback during the 2nd attempt). so, we 
need a plan for restore as well.

But with our enhancement to rollback to publish a plan, Rollback.requested 
can't be skipped and we have to publish to timeline. So, here is what will 
happen w/o a restore plan.

 

start restore

    rollback commit N

          rollback.requested for commit N// plan.

          execute rollback, but do not publish to timeline. so this will not 
get applied to metadata table. 

    rollback commit N-1

           rollback.requested for commit N-1 // plan

          execute rollback, but do not publish to timeline. again, will not get 
applied to metadata table. 

     .

commit restore and publish. this will get applied to metadata table. 

Once we are done committing restore, we can remove all rollback.requested files 
if needed. 

 

Failure scenarios: 

If after 2 rollbacks, we fail. 

on re-attempt, we will process remaining commits since active timeline may not 
report commitN and commitN-1 as active. So, we can do something like below w/ a 
restore plan.

 

start restore

    schedule rollback for all of them. 

     serialize all commit instants that need to be rolledback along with the 
rollback plan. // by now, we would have created rollback.requested meta file 
for all commits that need to be rolled back. 

      now execute rolback one by one. // do not publish to timeline once done. 
also changes should not be applied to metadata table. 

collect rollback commit metadata from all individual rollbacks and create the 
restore commit metadata. there could be some commits which was already 
rolledback, and for those, we need to manually create rollback metadata based 
on rollback plan. More details in next para. commit restore and publish. this 
will get applied to metadata table. 

 

Failures:

if we fail after 2nd rollback:

on 2nd attempt, we will look at retstore plan for all commits that needs to be 
rolledback. So, we can't really rollback the first 2 since they are already 
rolled back. And so, we will manually create rollback metadata from 
rollback.requested meta file. and for rest, we will follow the regular flow of 
executing actual rollback and collecting rollback metadata. Once complete, we 
will serialize all this info in restore metadata which gets applied to metadata 
table. 

 

Alternatives: But since restore anyway is a destructive operation and is 
advised to stop all processes, we do have an option to clean up metadata table 
and rebootstrap completely once restore is complete. 

 

 

 

  was:
Fix restore by adding a requested instant and restore plan

 

Trying to see if we really need a plan. Dumping my thoughts here. 

Restore internally converts to N no of rollbacks. We fetch active instants in 
reverse order from timeline and trigger rollbacks 1 by 1. We have already have 
a patch fixing rollback to add rollback Plan in rollback.requested meta file. 
So, walking through failure scenarios. 

 

If 5 instants need to be rolledback, but process crashed after 3 rollbacks. 
 * When we retry restore 2nd time, only pending 2 will be returned from 
timeline for instants that need to be rolledback. And so we will rollback 
remaining 2 commits/instants. Only missing piece will be the list of rollback 
metadata that gets serialized as part of restore commit metadata might miss 
first 3 commits. Anyways, restore is a destructive operation, not sure if not 
serializing the already rolledback commit to restore commit metadata will cause 
any issues. 
 ** Metadata table: first 3 would have been rolledback in metadata table as 
well (applied as upsert). and so should be fine when we retrigger the restore. 
the rest 2 will get applied. 
 * If there was a crash during a rollback was inflight.
 ** let's say rollback of c3 failed while in progress. when we re-attempt 
restore, we will again try to rollback c3 again. With the fix for rollback plan 
in place, we should be good as we will continue the rollback and get it to 
completion. and then go on to rollback C2 and C1. 
 ** Metadata table: for first time, since the rollback of C3 failed while 
inflight, there won't be any trace of this in metadata table. but when we retry 
for 2nd time, this should get applied to metadata table. the rollback plan fix 
should ensure rollback commit metadata has all file info from original plan and 
not just the successfully deleted ones. bcoz, in this case, during 2nd time, 
only pending files will be deleted.
 ** If by chance, one of the rollback gets committted to metadata table and 
failed before getting committed to data table: the 2nd time rollback of same 
instant is yet another delta commit to metadata table and so we should be good 
there too. we might instruct metadata table to delete files repeatedly may be. 

 

Update:

I didn't realize that individual rollbacks are not published to timeline as 
part of restore. So, if restore fails midway, in the 2nd attempt, only subset 
of rollback will be applied to metadata table(which got rolledback during the 
2nd attempt). so, we need a plan for restore as well. But some details on how 
to go about this. bcoz, if reattempted, for the ones which were rolled back 
during 1st attempt, we should skip rolling back again, but fetch the commit 
metadata from rollback.completed file and add it to restore metadata. 

Alternatives: But since restore anyway is a destructive operation and is 
advised to stop all processes, we do have an option to clean up metadata table 
and rebootstrap completely once restore is complete. 

 

 

 


> Fix restore by adding a requested instant and restore plan
> ----------------------------------------------------------
>
>                 Key: HUDI-2432
>                 URL: https://issues.apache.org/jira/browse/HUDI-2432
>             Project: Apache Hudi
>          Issue Type: Sub-task
>            Reporter: sivabalan narayanan
>            Assignee: sivabalan narayanan
>            Priority: Major
>             Fix For: 0.10.0
>
>
> Fix restore by adding a requested instant and restore plan
>  
> Trying to see if we really need a plan. Dumping my thoughts here. 
> Restore internally converts to N no of rollbacks. We fetch active instants in 
> reverse order from timeline and trigger rollbacks 1 by 1. We have already 
> have a patch fixing rollback to add rollback Plan in rollback.requested meta 
> file. So, walking through failure scenarios. 
>  
> With restore, individual rollbacks are not published to timeline. So, if 
> restore fails midway, in the 2nd attempt, only subset of rollback will be 
> applied to metadata table(which got rolledback during the 2nd attempt). so, 
> we need a plan for restore as well.
> But with our enhancement to rollback to publish a plan, Rollback.requested 
> can't be skipped and we have to publish to timeline. So, here is what will 
> happen w/o a restore plan.
>  
> start restore
>     rollback commit N
>           rollback.requested for commit N// plan.
>           execute rollback, but do not publish to timeline. so this will not 
> get applied to metadata table. 
>     rollback commit N-1
>            rollback.requested for commit N-1 // plan
>           execute rollback, but do not publish to timeline. again, will not 
> get applied to metadata table. 
>      .
> commit restore and publish. this will get applied to metadata table. 
> Once we are done committing restore, we can remove all rollback.requested 
> files if needed. 
>  
> Failure scenarios: 
> If after 2 rollbacks, we fail. 
> on re-attempt, we will process remaining commits since active timeline may 
> not report commitN and commitN-1 as active. So, we can do something like 
> below w/ a restore plan.
>  
> start restore
>     schedule rollback for all of them. 
>      serialize all commit instants that need to be rolledback along with the 
> rollback plan. // by now, we would have created rollback.requested meta file 
> for all commits that need to be rolled back. 
>       now execute rolback one by one. // do not publish to timeline once 
> done. also changes should not be applied to metadata table. 
> collect rollback commit metadata from all individual rollbacks and create the 
> restore commit metadata. there could be some commits which was already 
> rolledback, and for those, we need to manually create rollback metadata based 
> on rollback plan. More details in next para. commit restore and publish. this 
> will get applied to metadata table. 
>  
> Failures:
> if we fail after 2nd rollback:
> on 2nd attempt, we will look at retstore plan for all commits that needs to 
> be rolledback. So, we can't really rollback the first 2 since they are 
> already rolled back. And so, we will manually create rollback metadata from 
> rollback.requested meta file. and for rest, we will follow the regular flow 
> of executing actual rollback and collecting rollback metadata. Once complete, 
> we will serialize all this info in restore metadata which gets applied to 
> metadata table. 
>  
> Alternatives: But since restore anyway is a destructive operation and is 
> advised to stop all processes, we do have an option to clean up metadata 
> table and rebootstrap completely once restore is complete. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-2432) Fix restore by adding a requested instant and restore plan

Reply via email to