Hi Peter,

Thanks for the comments.
Updated in v2. 
Note: I have included a null check in the fail statement of
migration coroutine as well.
This is so that there are no dereferencing
issue in case local_err is not set in any of the paths.
(error_report_err call dereferences the err->msg)


On Wed, Jun 25, 2025 at 09:15:53AM -0400, Peter Xu wrote:
> On Wed, Jun 25, 2025 at 05:24:10PM +0530, Arun Menon wrote:
> > Hi Peter,
> 
> Hi, Arun,
> 
> [...]
> 
> > > >  static void vmstate_save_old_style(QEMUFile *f, SaveStateEntry *se,
> > > > @@ -2071,6 +2072,7 @@ static void *postcopy_ram_listen_thread(void 
> > > > *opaque)
> > > >  {
> > > >      MigrationIncomingState *mis = migration_incoming_get_current();
> > > >      QEMUFile *f = mis->from_src_file;
> > > > +    Error *local_err = NULL;
> > > >      int load_res;
> > > >      MigrationState *migr = migrate_get_current();
> > > >  
> > > > @@ -2089,7 +2091,7 @@ static void *postcopy_ram_listen_thread(void 
> > > > *opaque)
> > > >      qemu_file_set_blocking(f, true);
> > > >  
> > > >      /* TODO: sanity check that only postcopiable data will be loaded 
> > > > here */
> > > > -    load_res = qemu_loadvm_state_main(f, mis);
> > > > +    load_res = qemu_loadvm_state_main(f, mis, &local_err);
> > > 
> > > Here we captured the error but ignored it.  AFAIU it'll be the same as
> > > NULL..
> > > 
> > > Not sure if you tried to trigger such vTPM migration failure with postcopy
> > > yet.  AFAIU this path will be for that.  To achieve your goal and make 
> > > sure
> > > the error appears for postcopy too, you may want to make use of this
> > > local_err, probably by converting below (outside the diff context):
> > > 
> > >         qemu_file_set_error(f, load_res);
> > >         ...
> > >         
> > >         } else {
> > >             error_report("%s: loadvm failed: %d", __func__, load_res);
> > >             migrate_set_state(&mis->state, 
> > > MIGRATION_STATUS_POSTCOPY_ACTIVE,
> > >                                            MIGRATION_STATUS_FAILED);
> > >         }
> > > 
> > > Into:
> > > 
> > >         error_prepend(...);
> > >         migrate_set_error(s, local_err);
> > > 
> > > Some test will be needed to make sure it works.
> > > 
> > > Side note: we really should have some migration failure tests on 
> > > mismatched
> > > devices or device states / configs.  We have a bunch of tests under
> > > migration-test.c.  Feel free to have a look if you like to add precopy /
> > > postcopy unit tests for such case.  It's also ok to leave that for later -
> > > I don't want to keep piling up work for you, and I already appreciate your
> > > help. :)
> > > 
> > Yes, the postcopy errors also need to be propagated.
> > I still need to figure out to build a test case around post-copy.
> > Maybe we can do that in another commit/ticket.
> 
> Yes, for now it'll be good enough if the fix can cover postcopy for the
> same issue.  Manual test should be OK to switch postcopy rightaway in the
> wrong vTPM setup, and one should observe failure already, checking the
> query-migrate on dest on errors.

I need to package this version of qemu and try it on beaker machine,
in order to have 2 bare metal instances for migration.
I tried locally, running 2 QEMU instances, but it fails because of 
a file descriptor error, probably because I am using the same 
qcow2 image to spin the vm.
If there is any better way to do it, please let me know. 
Up until now I was saving the state in to a file and restoring from it.
I suppose postcopy requires the migration to be "live", and 2 running
qemu instances are necessary.

> 
> It's possible that postcopy may still have an issue that might hang src
> QEMU if it fails exactly at device transfer stage; I may need to have a
> closer look later and see whether it's true and if it's fixable - I think
> it is.  The ideal way is if device state migration failed when
> postcopy_start(), then we should fail the migration and continue on src
> QEMU (like a precopy failure, as postcopy hasn't yet really started).

Yes the source QEMU hangs while trying with 2 qemu instances on the same 
machine,
attached to 2 separate vTPM devices (encrypted with differen secrets).
I suppose exit-on-error does not apply to the postcopy-ram capability because
the destination QEMU exited.

> 
> For this specific issue, as long as the error will pop up correctly on dest
> after postcopy failed, it should be good.
> 
> The auto test case can be for later.
> 
> [...]
> 
> > > > @@ -3094,27 +3099,24 @@ out:
> > > >      return ret;
> > > >  }
> > > >  
> > > > -int qemu_loadvm_state(QEMUFile *f)
> > > > +int qemu_loadvm_state(QEMUFile *f, Error **errp)
> > > >  {
> > > >      MigrationState *s = migrate_get_current();
> > > >      MigrationIncomingState *mis = migration_incoming_get_current();
> > > > -    Error *local_err = NULL;
> > > >      int ret;
> > > >  
> > > > -    if (qemu_savevm_state_blocked(&local_err)) {
> > > > -        error_report_err(local_err);
> > > > +    if (qemu_savevm_state_blocked(errp)) {
> > > 
> > > Another thing to be careful here: I didn't check whether errp can be NULL
> > > here, likely it can.
> > > 
> > > In that case we'd better keep local_err, because error_setg() (inside of
> > > qemu_savevm_state_blocked) will ignore the error otherwise..
> > > 
> > > static void error_setv(Error **errp,
> > >                        const char *src, int line, const char *func,
> > >                        ErrorClass err_class, const char *fmt, va_list ap,
> > >                        const char *suffix)
> > > {
> > >     Error *err;
> > >     int saved_errno = errno;
> > > 
> > >     if (errp == NULL) {
> > >         return;
> > >     }
> > >     assert(*errp == NULL);
> > >     ...
> > > }
> > > 
> > > So here we can keep local_err but use error_propagate().
> > 
> > Please correct me if I am wrong, as far as I have checked,
> > qemu_loadvm_state() is called at 3 places
> >  - load_snapshot
> >  - qmp_xen_load_devices_state
> >  - process_incoming_migration_co
> > and in all these functions, Error **errp is passed.
> > I did not find a function that passes NULL.
> > Is it still required to declare and pass a local_err object?
> 
> In that case (this should include all the callers of the three call sites
> never passing in NULL) we're almost asserting errp!=NULL, then it should be
> OK.
> 
> Thanks,
> 
> -- 
> Peter Xu
> 


Thanks,
Arun


Reply via email to