On Wed, Jun 25, 2025 at 05:24:10PM +0530, Arun Menon wrote:
> Hi Peter,

Hi, Arun,

[...]

> > >  static void vmstate_save_old_style(QEMUFile *f, SaveStateEntry *se,
> > > @@ -2071,6 +2072,7 @@ static void *postcopy_ram_listen_thread(void 
> > > *opaque)
> > >  {
> > >      MigrationIncomingState *mis = migration_incoming_get_current();
> > >      QEMUFile *f = mis->from_src_file;
> > > +    Error *local_err = NULL;
> > >      int load_res;
> > >      MigrationState *migr = migrate_get_current();
> > >  
> > > @@ -2089,7 +2091,7 @@ static void *postcopy_ram_listen_thread(void 
> > > *opaque)
> > >      qemu_file_set_blocking(f, true);
> > >  
> > >      /* TODO: sanity check that only postcopiable data will be loaded 
> > > here */
> > > -    load_res = qemu_loadvm_state_main(f, mis);
> > > +    load_res = qemu_loadvm_state_main(f, mis, &local_err);
> > 
> > Here we captured the error but ignored it.  AFAIU it'll be the same as
> > NULL..
> > 
> > Not sure if you tried to trigger such vTPM migration failure with postcopy
> > yet.  AFAIU this path will be for that.  To achieve your goal and make sure
> > the error appears for postcopy too, you may want to make use of this
> > local_err, probably by converting below (outside the diff context):
> > 
> >         qemu_file_set_error(f, load_res);
> >         ...
> >         
> >         } else {
> >             error_report("%s: loadvm failed: %d", __func__, load_res);
> >             migrate_set_state(&mis->state, MIGRATION_STATUS_POSTCOPY_ACTIVE,
> >                                            MIGRATION_STATUS_FAILED);
> >         }
> > 
> > Into:
> > 
> >         error_prepend(...);
> >         migrate_set_error(s, local_err);
> > 
> > Some test will be needed to make sure it works.
> > 
> > Side note: we really should have some migration failure tests on mismatched
> > devices or device states / configs.  We have a bunch of tests under
> > migration-test.c.  Feel free to have a look if you like to add precopy /
> > postcopy unit tests for such case.  It's also ok to leave that for later -
> > I don't want to keep piling up work for you, and I already appreciate your
> > help. :)
> > 
> Yes, the postcopy errors also need to be propagated.
> I still need to figure out to build a test case around post-copy.
> Maybe we can do that in another commit/ticket.

Yes, for now it'll be good enough if the fix can cover postcopy for the
same issue.  Manual test should be OK to switch postcopy rightaway in the
wrong vTPM setup, and one should observe failure already, checking the
query-migrate on dest on errors.

It's possible that postcopy may still have an issue that might hang src
QEMU if it fails exactly at device transfer stage; I may need to have a
closer look later and see whether it's true and if it's fixable - I think
it is.  The ideal way is if device state migration failed when
postcopy_start(), then we should fail the migration and continue on src
QEMU (like a precopy failure, as postcopy hasn't yet really started).

For this specific issue, as long as the error will pop up correctly on dest
after postcopy failed, it should be good.

The auto test case can be for later.

[...]

> > > @@ -3094,27 +3099,24 @@ out:
> > >      return ret;
> > >  }
> > >  
> > > -int qemu_loadvm_state(QEMUFile *f)
> > > +int qemu_loadvm_state(QEMUFile *f, Error **errp)
> > >  {
> > >      MigrationState *s = migrate_get_current();
> > >      MigrationIncomingState *mis = migration_incoming_get_current();
> > > -    Error *local_err = NULL;
> > >      int ret;
> > >  
> > > -    if (qemu_savevm_state_blocked(&local_err)) {
> > > -        error_report_err(local_err);
> > > +    if (qemu_savevm_state_blocked(errp)) {
> > 
> > Another thing to be careful here: I didn't check whether errp can be NULL
> > here, likely it can.
> > 
> > In that case we'd better keep local_err, because error_setg() (inside of
> > qemu_savevm_state_blocked) will ignore the error otherwise..
> > 
> > static void error_setv(Error **errp,
> >                        const char *src, int line, const char *func,
> >                        ErrorClass err_class, const char *fmt, va_list ap,
> >                        const char *suffix)
> > {
> >     Error *err;
> >     int saved_errno = errno;
> > 
> >     if (errp == NULL) {
> >         return;
> >     }
> >     assert(*errp == NULL);
> >     ...
> > }
> > 
> > So here we can keep local_err but use error_propagate().
> 
> Please correct me if I am wrong, as far as I have checked,
> qemu_loadvm_state() is called at 3 places
>  - load_snapshot
>  - qmp_xen_load_devices_state
>  - process_incoming_migration_co
> and in all these functions, Error **errp is passed.
> I did not find a function that passes NULL.
> Is it still required to declare and pass a local_err object?

In that case (this should include all the callers of the three call sites
never passing in NULL) we're almost asserting errp!=NULL, then it should be
OK.

Thanks,

-- 
Peter Xu


Reply via email to