Re: [PATCH v8 23/27] migration: Propagate last encountered error in vmstate_save_state_v() function

Arun Menon Mon, 04 Aug 2025 00:50:38 -0700

Hi Akihiko,

On Sun, Aug 03, 2025 at 04:15:16PM +0900, Akihiko Odaki wrote:
> On 2025/08/02 2:27, Arun Menon wrote:
> > Hi Akihiko,
> > Thanks for the review.
> > 
> > On Fri, Aug 01, 2025 at 04:35:34PM +0900, Akihiko Odaki wrote:
> > > On 2025/07/31 22:21, Arun Menon wrote:
> > > > Currently post_save hook is called without checking its return. If 
> > > > post_save
> > > > fails, we need to set an error and propagate it to the caller.
> > > > 
> > > > Since post_save hook is called regardless of whether there is a 
> > > > preceeding error,
> > > > it is possible that we have 2 distict errors, one from the preceeding 
> > > > function
> > > > call, and the other from the post_save call.
> > > > 
> > > > Return the latest error to the caller.
> > > 
> > > This needs to be explained better. This patch makes two changes on the
> > > behavior when there are two errors:
> > > 
> > > 1) Proceeding errors were propagated before, but they are now
> > >     dismissed.
> > > 2) post_error() errors were dismissed before, but they are now
> > >     propagated.
> > > 
> > > This message doesn't mention 1) at all. It does say 2) is necessary, but
> > > does not explain why.
> > 
> > Please correct me if I am wrong, does the following look ok?
> > 
> > Currently post_save() hook is called without checking its return.
> > post_save() hooks, generally does the cleanup, and that is the reason why
> > we have been dismissing its failure.
> > 
> > We want to consider
> >    - the saving of the device state (save or subsection save) and
> >    - running the cleanup after
> > as an atomic operation. And that is why, post_save is called regardless
> > of whether there is a preceeding error.
> > This means that, it is possible that we have 2 distict errors, one from
> > the preceeding function and the other from the post_save() function.
> > 
> > This patch makes two changes on the behavior when there are two errors:
> > 
> > 1) Preceeding errors were propagated before, but they are now
> >     dismissed if there is a new post_save() error.
> > 2) post_save() errors were dismissed before, but they are now
> >     propagated.
> > 
> > We intend to extend the error propagation feature (saving erros in
> > Error object and return to the caller) to the post/pre
> > save/load hooks. Therefore, it is important for the user to know
> > the exact reason of failure in case any of these hooks fail.
> > 
> > ====
> > I do not know much about the operations we do, or intend to do
> > using the post_save() call. But my guess is that we are going to
> > revert things or cleanup stuff, and if reverting/cleanup fails,
> > then the user should be notified about that.
> 
> All what you stated sounds correct to me and explains 2). But there is still
> no reasoning for 1).
> 
> Previously I listed three possible design options:
> > - propagate whatever error easier to do so
> > - somehow combine errors
> >   (languages that support exceptions often do this)
> > - make sure that no error will happen when there is a proceeding error
> >   by having two callbacks:
> >   - One callback that is called only when there is no proceeding error
> >     and can raise an error
> >   - Another that is always called but cannot raise an error
> 
> https://lore.kernel.org/qemu-devel/1ff92294-5c8c-4b33-89c1-91d37d6ac...@rsg.ci.i.u-tokyo.ac.jp/
> 
> It will be a good reasoning if we can show propagating post_save() errors is
> easier (or more important) than propagating proceeding errors.
> 
> There are alternative options. "Making sure that no error will happen when
> there is a proceeding error by having two callbacks" is one of them. You
> wrote:
> > We want to consider
> >    - the saving of the device state (save or subsection save) and
> >    - running the cleanup after
> > as an atomic operation. And that is why, post_save is called
> > regardless of whether there is a preceeding error.
> > This means that, it is possible that we have 2 distict errors, one
> > from the preceeding function and the other from the post_save()
> > function.
> 
> But it may not be mandatory. In fact, I searched "post_save" in the code
> base and found all implementations are for cleanup and always succeeds,
> which makes sense; most (if not all) cleanup operations like g_free() always
> succeeds in general.


post_save calls do succeed in general.
Now, we are introducing the post_save_errp() function which is an alternative 
to the
post_save() call. We are adding the feature to catch errors in Error object,
thereby clearly considering the case where post_save_erp can go wrong.
So we have already deviated from the old design that all post_save operations 
are
infalliable cleaup routines. The only catch is, these alternate functions are 
introduced
a couple of commits later.

If that is the case, then we need to handle both the errors, and propagate only
the most significant (which in my opinion is the post_save) one.

In other words, the reasoning for 1, 
        1) Proceeding errors were propagated before, but they are now dismissed.

is that,
a failure during the preceeding operation means that we failed in saving the 
state
but a failure during the post-save call, means that we failed in cleanup or 
restoring
back the system to a stable state, which is a more serious one, because it may 
lead to
the system being in an inconsistent state.
Can I write something on these lines, to justify the change?

> 
> So perhaps "making sure that no error will happen when there is a proceeding
> error" may be feasible, and it may not even require two callbacks because
> post_save() always succeeds.

I am sorry, I could not completely understand the part, where you mentioned that
we need to have 2 callbacks; I am assuming you are mentioning something like 
try..catch,
finally blocks from the other languages. 
post_save (that can not raise an error) being called in the goto block.

So if the main operation fails, we say goto cleanup, and then execute post_save 
which always
succeeds. If the main operation succeeds then we call post_save that can raise 
an error. Please correct
me if I am wrong.


> 
> Regards,
> Akihiko Odaki
> 

Regards,
Arun

Re: [PATCH v8 23/27] migration: Propagate last encountered error in vmstate_save_state_v() function

Reply via email to