Hi Akihiko,

On Wed, Sep 17, 2025 at 04:54:11PM +0900, Akihiko Odaki wrote:
> On 2025/09/17 9:34, Arun Menon wrote:
> > Hi Akihiko,
> > 
> > On Sat, Sep 06, 2025 at 05:22:31AM +0200, Akihiko Odaki wrote:
> > > On 2025/09/03 8:47, Arun Menon wrote:
> > > > Hi Akihiko,
> > > > 
> > > > It took some time to set up the machines; apologies for the delay in 
> > > > response.
> > > > 
> > > > On Mon, Sep 01, 2025 at 02:12:54AM +0900, Akihiko Odaki wrote:
> > > > > On 2025/09/01 1:38, Arun Menon wrote:
> > > > > > Hi,
> > > > > > 
> > > > > > On Mon, Sep 01, 2025 at 01:04:40AM +0900, Akihiko Odaki wrote:
> > > > > > > On 2025/09/01 0:45, Arun Menon wrote:
> > > > > > > > Hi Akihiko,
> > > > > > > > Thanks for the review.
> > > > > > > > 
> > > > > > > > On Sat, Aug 30, 2025 at 02:58:05PM +0900, Akihiko Odaki wrote:
> > > > > > > > > On 2025/08/30 5:01, Arun Menon wrote:
> > > > > > > > > > This is an incremental step in converting vmstate loading
> > > > > > > > > > code to report error via Error objects instead of directly
> > > > > > > > > > printing it to console/monitor.
> > > > > > > > > > It is ensured that qemu_loadvm_state() must report an error
> > > > > > > > > > in errp, in case of failure.
> > > > > > > > > > 
> > > > > > > > > > When postcopy live migration runs, the device states are 
> > > > > > > > > > loaded by
> > > > > > > > > > both the qemu coroutine process_incoming_migration_co() and 
> > > > > > > > > > the
> > > > > > > > > > postcopy_ram_listen_thread(). Therefore, it is important 
> > > > > > > > > > that the
> > > > > > > > > > coroutine also reports the error in case of failure, with
> > > > > > > > > > error_report_err(). Otherwise, the source qemu will not 
> > > > > > > > > > display
> > > > > > > > > > any errors before going into the postcopy pause state.
> > > > > > > > > > 
> > > > > > > > > > Reviewed-by: Marc-André Lureau <marcandre.lur...@redhat.com>
> > > > > > > > > > Reviewed-by: Fabiano Rosas <faro...@suse.de>
> > > > > > > > > > Signed-off-by: Arun Menon <arme...@redhat.com>
> > > > > > > > > > ---
> > > > > > > > > >       migration/migration.c |  9 +++++----
> > > > > > > > > >       migration/savevm.c    | 30 
> > > > > > > > > > ++++++++++++++++++------------
> > > > > > > > > >       migration/savevm.h    |  2 +-
> > > > > > > > > >       3 files changed, 24 insertions(+), 17 deletions(-)
> > > > > > > > > > 
> > > > > > > > > > diff --git a/migration/migration.c b/migration/migration.c
> > > > > > > > > > index 
> > > > > > > > > > 10c216d25dec01f206eacad2edd24d21f00e614c..c6768d88f45c870c7fad9b9957300766ff69effc
> > > > > > > > > >  100644
> > > > > > > > > > --- a/migration/migration.c
> > > > > > > > > > +++ b/migration/migration.c
> > > > > > > > > > @@ -881,7 +881,7 @@ process_incoming_migration_co(void 
> > > > > > > > > > *opaque)
> > > > > > > > > >                             MIGRATION_STATUS_ACTIVE);
> > > > > > > > > >           mis->loadvm_co = qemu_coroutine_self();
> > > > > > > > > > -    ret = qemu_loadvm_state(mis->from_src_file);
> > > > > > > > > > +    ret = qemu_loadvm_state(mis->from_src_file, 
> > > > > > > > > > &local_err);
> > > > > > > > > >           mis->loadvm_co = NULL;
> > > > > > > > > >           
> > > > > > > > > > trace_vmstate_downtime_checkpoint("dst-precopy-loadvm-completed");
> > > > > > > > > > @@ -908,7 +908,8 @@ process_incoming_migration_co(void 
> > > > > > > > > > *opaque)
> > > > > > > > > >           }
> > > > > > > > > >           if (ret < 0) {
> > > > > > > > > > -        error_setg(&local_err, "load of migration failed: 
> > > > > > > > > > %s", strerror(-ret));
> > > > > > > > > > +        error_prepend(&local_err, "load of migration 
> > > > > > > > > > failed: %s: ",
> > > > > > > > > > +                      strerror(-ret));
> > > > > > > > > >               goto fail;
> > > > > > > > > >           }
> > > > > > > > > > @@ -924,13 +925,13 @@ fail:
> > > > > > > > > >           migrate_set_state(&mis->state, 
> > > > > > > > > > MIGRATION_STATUS_ACTIVE,
> > > > > > > > > >                             MIGRATION_STATUS_FAILED);
> > > > > > > > > >           migrate_set_error(s, local_err);
> > > > > > > > > > -    error_free(local_err);
> > > > > > > > > > +    error_report_err(local_err);
> > > > > > > > > 
> > > > > > > > > This is problematic because it results in duplicate error 
> > > > > > > > > reports when
> > > > > > > > > !mis->exit_on_error; in that case the query-migrate QMP 
> > > > > > > > > command reports the
> > > > > > > > > error and this error reporting is redundant.
> > > > > > > > 
> > > > > > > > If I comment this change, then all of the errors propagated up 
> > > > > > > > to now, using
> > > > > > > > error_setg() will not be reported. This is the place where it 
> > > > > > > > is finally reported,
> > > > > > > > when qemu_loadvm_state() fails. In other words, all the 
> > > > > > > > error_reports() we removed
> > > > > > > > from all the files, replacing them with error_setg(), will 
> > > > > > > > finally be reported here
> > > > > > > > using error_report_err().
> > > > > > > 
> > > > > > > My understanding of the code without these two changes is:
> > > > > > > - If the migrate-incoming QMP command is used with false as
> > > > > > >      exit-on-error, this function will not report the error but
> > > > > > >      the query-migrate QMP command will report the error.
> > > > > > > - Otherwise, this function reports the error.
> > > > > > 
> > > > > > With my limited experience in testing, I have a question,
> > > > > > So there are 2 scenarios,
> > > > > > 1. running the virsh migrate command on the source host. Something 
> > > > > > like the following,
> > > > > >      virsh -c 'qemu:///system' migrate --live --verbose --domain 
> > > > > > guest-vm --desturi qemu+ssh://10.6.120.20/system
> > > > > >      OR for postcopy-ram,
> > > > > >      virsh migrate guest-vm --live qemu+ssh://10.6.120.20/system 
> > > > > > --verbose --postcopy --timeout 10 --timeout-postcopy
> > > > > > 
> > > > > > 2. Using QMP commands, performing a migration from source to 
> > > > > > destination.
> > > > > >      Running something like the following on the destination:
> > > > > >      {
> > > > > >        "execute": "migrate-incoming",
> > > > > >        "arguments": {
> > > > > >          "uri": "tcp:127.0.0.1:7777",
> > > > > >          "exit-on-error": false
> > > > > >        }
> > > > > >      }
> > > > > >      {
> > > > > >        "execute": "migrate-incoming",
> > > > > >        "arguments": {
> > > > > >          "uri": "tcp:127.0.0.1:7777",
> > > > > >          "exit-on-error": false
> > > > > >        }
> > > > > >      }
> > > > > >      and the somthing like the following on source:
> > > > > >      {
> > > > > >        "execute": "migrate",
> > > > > >        "arguments": {
> > > > > >          "uri": "tcp:127.0.0.1:7777"
> > > > > >        }
> > > > > >      }
> > > > > >      {"execute" : "query-migrate"}
> > > > > > 
> > > > > > In 1, previously, the user used to get an error message on 
> > > > > > migration failure.
> > > > > > This was because there were error_report() calls in all of the 
> > > > > > files.
> > > > > > Now that they are replaced with error_setg() and the error is 
> > > > > > stored in errp,
> > > > > > we need to display that using error_report_err(). Hence I 
> > > > > > introduced an error_report_err()
> > > > > > call in the fail section.
> > > > > > 
> > > > > > In 2, we have 2 QMP sessions, one for the source and another for 
> > > > > > the destination.
> > > > > > The QMP command migrate will be issued on the source, and the errp 
> > > > > > will be set.
> > > > > > I did not understand the part where the message will be displayed 
> > > > > > because of the
> > > > > > error_report_err() call. I did not see such a message on failure 
> > > > > > scenario on both
> > > > > > the sessions.
> > > > > > If the user wants to check for errors, then the destination qemu 
> > > > > > will not exit
> > > > > > (exit-on-error = false ) and we can retrieve it using {"execute" : 
> > > > > > "query-migrate"}
> > > > > > 
> > > > > > Aren't the 2 scenarios different by nature?
> > > > > 
> > > > > In 1, doesn't libvirt query the error with query-migrate and print it?
> > > > 
> > > > Ideally it should find the the error, and print the whole thing. It 
> > > > does work
> > > > in the normal scenario. However, the postcopy scenario does not show 
> > > > the same result,
> > > > which is mentioned in the commit message.
> > > > 
> > > > > 
> > > > > In any case, it would be nice if you describe how libvirt interacts 
> > > > > with
> > > > > QEMU in 1.
> > > > 
> > > > Please find below the difference in the command output at source, when 
> > > > we run a live migration
> > > > with postcopy enabled.
> > > > 
> > > > =========
> > > > With the current changes:
> > > > [root@dell-per750-42 qemu-priv]# virsh migrate-setspeed guest-vm 1
> > > > 
> > > > [root@dell-per750-42 build]# virsh migrate guest-vm --live 
> > > > qemu+ssh://10.6.120.9/system --verbose --postcopy --timeout 10 
> > > > --timeout-postcopy
> > > > root@10.6.120.9's password:
> > > > Migration: [ 1.26 %]error: internal error: QEMU unexpectedly closed the 
> > > > monitor (vm='guest-vm'): 2025-09-03T06:19:15.076547Z 
> > > > qemu-system-x86_64: -accel kvm: warning: Number of SMP cpus requested 
> > > > (2) exceeds the recommended cpus supported by KVM (1)
> > > > 2025-09-03T06:19:15.076586Z qemu-system-x86_64: -accel kvm: warning: 
> > > > Number of hotpluggable cpus requested (2) exceeds the recommended cpus 
> > > > supported by KVM (1)
> > > > 2025-09-03T06:19:27.776715Z qemu-system-x86_64: load of migration 
> > > > failed: Input/output error: error while loading state for instance 0x0 
> > > > of device 'tpm-emulator': post load hook failed for: tpm-emulator, 
> > > > version_id: 0, minimum_version: 0, ret: -5: tpm-emulator: Setting the 
> > > > stateblob (type 1) failed with a TPM error 0x21 decryption error
> > > > 
> > > > [root@dell-per750-42 build]#
> > > > 
> > > > =========
> > > > 
> > > > Without the current changes:
> > > > [root@dell-per750-42 qemu-priv]# virsh migrate-setspeed guest-vm 1
> > > > 
> > > > [root@dell-per750-42 qemu-priv]# virsh migrate guest-vm --live 
> > > > qemu+ssh://10.6.120.9/system --verbose --postcopy --timeout 10 
> > > > --timeout-postcopy
> > > > root@10.6.120.9's password:
> > > > Migration: [ 1.28 %]error: internal error: QEMU unexpectedly closed the 
> > > > monitor (vm='guest-vm'): 2025-09-03T06:26:17.733786Z 
> > > > qemu-system-x86_64: -accel kvm: warning: Number of SMP cpus requested 
> > > > (2) exceeds the recommended cpus supported by KVM (1)
> > > > 2025-09-03T06:26:17.733830Z qemu-system-x86_64: -accel kvm: warning: 
> > > > Number of hotpluggable cpus requested (2) exceeds the recommended cpus 
> > > > supported by KVM (1)
> > > > 
> > > > [root@dell-per750-42 qemu-priv]#
> > > > 
> > > > =========
> > > > The original behavior was to print the error to the console regardless 
> > > > of whether the migration is normal or postcopy.
> > > 
> > > This was true for messages in qemu_loadvm_state(), but the message "load 
> > > of
> > > migration failed" was printed or queried with query-migrate, not both. We
> > > should think of which behavior is more appropriate, and I think we should
> > > avoid duplicate reports.
> > > 
> > > > The source machine goes in to a paused state after this.
> > > 
> > > The output is informative. It implies the destination machine exited, and 
> > > it
> > > makes sense to print error messages as it is done for
> > > mis->exit_on_error. I wonder if it is possible to detect the condition and
> > > treat it identically to mis->exit_on_error.
> > 
> > I see that we want to catch a specific scenario in postcopy ram migration
> > where the destination abruptly exits without a graceful shutdown,
> > thus failing to inform the source the reason for its failure through a
> > 'query-migrate' even though 'exit-on-error' was set to false on the 
> > destination.
> > 
> > However, I am not sure how to reliably detect the specific error condition 
> > of
> > such a connection close that you have described. Given that this is a large
> > patch series already, could we keep the current change as is for now?
> >  From what I can tell, the additional log message "load of migration failed"
> > is not a breaking change and will not cause a crash. We can develop a more
> > elegant solution to handle the issue of duplication in a separate patch.
> There are two regressions:
> 1) Duplicate error reports when exit-on-error is false and postcopy is
> disabled.

Thank you for your detailed response.
I am trying to fully understand the logic here, so please correct me if I'm 
wrong.
My understanding of the patch is that all errors are chained together
in local_err by propagating them through the call stack and prepending them with
"load of migration failed."
This creates a single, comprehensive error message, which is then passed to 
  migrate_set_error(s, local_err) in the fail section.
This line already existed. So in effect we are setting the error here.

Regarding the second regression you mentioned:

> 2) Errors reported code else qemu_loadvm_state() are ignored when
> exit-on-error is true.

Doesn't the error_prepend() function ensure that errors set higher up the call 
stack
are preserved and only "load of migration failed" is prepended to it?
When the fail section is reached, local_err will have the complete error 
message.

> 
> 1) is trivial yet difficult to fix and I agree that it can be handled later.
> Ideally there should be a comment to note that.
> 
> However, there is also 2), which is more serious and easier to fix so I
> suggest fixing it now. More concretely, the code will look like as follows:
> 
>     migrate_set_error(s, local_err);
>     if (mis->exit_on_error) {
>         error_free(local_err);
>     } else {
>         /*
>          * Report the error here in case that QEMU abruptly exits when
>          * postcopy is enabled.
>          */
>         error_report_err(s->error);
>     }
> 
>     migration_incoming_state_destroy();
> 
>     if (mis->exit_on_error) {
>         WITH_QEMU_LOCK_GUARD(&s->error_mutex) {
>             error_report_err(s->error);
> 
> This ensures errors set by anyone will be reported while duplicate error
> reports are avoided when exit-on-error is true.

yes, this part (set by anyone), I fail to follow. Do you mean that there can be
chances of a race condition between migrate_set_error() and error_report_err()?

My analysis of your proposed code shows that whether mis->exit_on_error is true 
(in the final if block)
or false (in the else block), error_report_err() is called. 
This seems to result in the error being reported regardless, which is similar 
to my original patch.
Yes in both the cases, s->error is being reported. However in my view, in the 
fail section, both
local_err and s->error will be the same.

I appreciate your patience as I try to understand this better.


> 
> Regards,
> Akihiko Odaki
> 

Regards,
Arun Menon


Reply via email to