Re: [PATCH v8 23/27] migration: Propagate last encountered error in vmstate_save_state_v() function

Akihiko Odaki Mon, 04 Aug 2025 12:25:09 -0700

On 2025/08/04 16:48, Arun Menon wrote:

Hi Akihiko,


On Sun, Aug 03, 2025 at 04:15:16PM +0900, Akihiko Odaki wrote:

On 2025/08/02 2:27, Arun Menon wrote:

Hi Akihiko,
Thanks for the review.

On Fri, Aug 01, 2025 at 04:35:34PM +0900, Akihiko Odaki wrote:

On 2025/07/31 22:21, Arun Menon wrote:

Currently post_save hook is called without checking its return. If post_save
fails, we need to set an error and propagate it to the caller.

Since post_save hook is called regardless of whether there is a preceeding 
error,
it is possible that we have 2 distict errors, one from the preceeding function
call, and the other from the post_save call.

Return the latest error to the caller.


This needs to be explained better. This patch makes two changes on the
behavior when there are two errors:

1) Proceeding errors were propagated before, but they are now
     dismissed.
2) post_error() errors were dismissed before, but they are now
     propagated.

This message doesn't mention 1) at all. It does say 2) is necessary, but
does not explain why.


Please correct me if I am wrong, does the following look ok?

Currently post_save() hook is called without checking its return.
post_save() hooks, generally does the cleanup, and that is the reason why
we have been dismissing its failure.

We want to consider
    - the saving of the device state (save or subsection save) and
    - running the cleanup after
as an atomic operation. And that is why, post_save is called regardless
of whether there is a preceeding error.
This means that, it is possible that we have 2 distict errors, one from
the preceeding function and the other from the post_save() function.

This patch makes two changes on the behavior when there are two errors:

1) Preceeding errors were propagated before, but they are now
     dismissed if there is a new post_save() error.
2) post_save() errors were dismissed before, but they are now
     propagated.

We intend to extend the error propagation feature (saving erros in
Error object and return to the caller) to the post/pre
save/load hooks. Therefore, it is important for the user to know
the exact reason of failure in case any of these hooks fail.

====
I do not know much about the operations we do, or intend to do
using the post_save() call. But my guess is that we are going to
revert things or cleanup stuff, and if reverting/cleanup fails,
then the user should be notified about that.


All what you stated sounds correct to me and explains 2). But there is still
no reasoning for 1).

Previously I listed three possible design options:

- propagate whatever error easier to do so
- somehow combine errors
   (languages that support exceptions often do this)
- make sure that no error will happen when there is a proceeding error
   by having two callbacks:
   - One callback that is called only when there is no proceeding error
     and can raise an error
   - Another that is always called but cannot raise an error


https://lore.kernel.org/qemu-devel/1ff92294-5c8c-4b33-89c1-91d37d6ac...@rsg.ci.i.u-tokyo.ac.jp/

It will be a good reasoning if we can show propagating post_save() errors is
easier (or more important) than propagating proceeding errors.

There are alternative options. "Making sure that no error will happen when
there is a proceeding error by having two callbacks" is one of them. You
wrote:

We want to consider
    - the saving of the device state (save or subsection save) and
    - running the cleanup after
as an atomic operation. And that is why, post_save is called
regardless of whether there is a preceeding error.
This means that, it is possible that we have 2 distict errors, one
from the preceeding function and the other from the post_save()
function.


But it may not be mandatory. In fact, I searched "post_save" in the code
base and found all implementations are for cleanup and always succeeds,
which makes sense; most (if not all) cleanup operations like g_free() always
succeeds in general.


post_save calls do succeed in general.
Now, we are introducing the post_save_errp() function which is an alternative 
to the
post_save() call. We are adding the feature to catch errors in Error object,
thereby clearly considering the case where post_save_erp can go wrong.
So we have already deviated from the old design that all post_save operations 
are
infalliable cleaup routines. The only catch is, these alternate functions are 
introduced
a couple of commits later.

If that is the case, then we need to handle both the errors, and propagate only
the most significant (which in my opinion is the post_save) one.

In other words, the reasoning for 1,
        1) Proceeding errors were propagated before, but they are now dismissed.

is that,
a failure during the preceeding operation means that we failed in saving the 
state
but a failure during the post-save call, means that we failed in cleanup or 
restoring
back the system to a stable state, which is a more serious one, because it may 
lead to
the system being in an inconsistent state.
Can I write something on these lines, to justify the change?

That is better, but also gives another question: when will "cleanup orrestoring back the system to a stable state" fail? We already know thatthe all post_save() always succeeds, so there should be some anticipateduse case that requires us to "deviate from the old design that allpost_save operations are infalliable cleanup routines".


So perhaps "making sure that no error will happen when there is a proceeding
error" may be feasible, and it may not even require two callbacks because
post_save() always succeeds.


I am sorry, I could not completely understand the part, where you mentioned that
we need to have 2 callbacks; I am assuming you are mentioning something like 
try..catch,
finally blocks from the other languages.
post_save (that can not raise an error) being called in the goto block.

Such two callbacks would handle two scenarios you previously wrote,respectively:

> We want to consider
>     - the saving of the device state (save or subsection save) and
>     - running the cleanup after
> as an atomic operation.

An analogy with try/catch/finally will indeed clarify the idea, though"catch" is not relevant here. It will look like as follows:


try {
  /* A pseudo function that represents the save operation. */
  save();

  /*
   * A new callback that can raise an error.
   * It will not be called if save() fails.
   */
  vmsd->commit_save();
} finally {
  /*
   * The old callback. Returning non-zero value is deprecated.
   * Any implementations that return a non-zero value should be
   * gradually converted to the new commit_save() callback.
   */
  vmsd->post_save();
}


So if the main operation fails, we say goto cleanup, and then execute post_save 
which always
succeeds. If the main operation succeeds then we call post_save that can raise 
an error. Please correct
me if I am wrong.


Yes, that's what I meant.

Regards,
Akihiko Odaki

Re: [PATCH v8 23/27] migration: Propagate last encountered error in vmstate_save_state_v() function

Reply via email to