Hi Peter, Thanks for the comments. Updated in v2. Note: I have included a null check in the fail statement of migration coroutine as well. This is so that there are no dereferencing issue in case local_err is not set in any of the paths. (error_report_err call dereferences the err->msg)
On Wed, Jun 25, 2025 at 09:15:53AM -0400, Peter Xu wrote: > On Wed, Jun 25, 2025 at 05:24:10PM +0530, Arun Menon wrote: > > Hi Peter, > > Hi, Arun, > > [...] > > > > > static void vmstate_save_old_style(QEMUFile *f, SaveStateEntry *se, > > > > @@ -2071,6 +2072,7 @@ static void *postcopy_ram_listen_thread(void > > > > *opaque) > > > > { > > > > MigrationIncomingState *mis = migration_incoming_get_current(); > > > > QEMUFile *f = mis->from_src_file; > > > > + Error *local_err = NULL; > > > > int load_res; > > > > MigrationState *migr = migrate_get_current(); > > > > > > > > @@ -2089,7 +2091,7 @@ static void *postcopy_ram_listen_thread(void > > > > *opaque) > > > > qemu_file_set_blocking(f, true); > > > > > > > > /* TODO: sanity check that only postcopiable data will be loaded > > > > here */ > > > > - load_res = qemu_loadvm_state_main(f, mis); > > > > + load_res = qemu_loadvm_state_main(f, mis, &local_err); > > > > > > Here we captured the error but ignored it. AFAIU it'll be the same as > > > NULL.. > > > > > > Not sure if you tried to trigger such vTPM migration failure with postcopy > > > yet. AFAIU this path will be for that. To achieve your goal and make > > > sure > > > the error appears for postcopy too, you may want to make use of this > > > local_err, probably by converting below (outside the diff context): > > > > > > qemu_file_set_error(f, load_res); > > > ... > > > > > > } else { > > > error_report("%s: loadvm failed: %d", __func__, load_res); > > > migrate_set_state(&mis->state, > > > MIGRATION_STATUS_POSTCOPY_ACTIVE, > > > MIGRATION_STATUS_FAILED); > > > } > > > > > > Into: > > > > > > error_prepend(...); > > > migrate_set_error(s, local_err); > > > > > > Some test will be needed to make sure it works. > > > > > > Side note: we really should have some migration failure tests on > > > mismatched > > > devices or device states / configs. We have a bunch of tests under > > > migration-test.c. Feel free to have a look if you like to add precopy / > > > postcopy unit tests for such case. It's also ok to leave that for later - > > > I don't want to keep piling up work for you, and I already appreciate your > > > help. :) > > > > > Yes, the postcopy errors also need to be propagated. > > I still need to figure out to build a test case around post-copy. > > Maybe we can do that in another commit/ticket. > > Yes, for now it'll be good enough if the fix can cover postcopy for the > same issue. Manual test should be OK to switch postcopy rightaway in the > wrong vTPM setup, and one should observe failure already, checking the > query-migrate on dest on errors. I need to package this version of qemu and try it on beaker machine, in order to have 2 bare metal instances for migration. I tried locally, running 2 QEMU instances, but it fails because of a file descriptor error, probably because I am using the same qcow2 image to spin the vm. If there is any better way to do it, please let me know. Up until now I was saving the state in to a file and restoring from it. I suppose postcopy requires the migration to be "live", and 2 running qemu instances are necessary. > > It's possible that postcopy may still have an issue that might hang src > QEMU if it fails exactly at device transfer stage; I may need to have a > closer look later and see whether it's true and if it's fixable - I think > it is. The ideal way is if device state migration failed when > postcopy_start(), then we should fail the migration and continue on src > QEMU (like a precopy failure, as postcopy hasn't yet really started). Yes the source QEMU hangs while trying with 2 qemu instances on the same machine, attached to 2 separate vTPM devices (encrypted with differen secrets). I suppose exit-on-error does not apply to the postcopy-ram capability because the destination QEMU exited. > > For this specific issue, as long as the error will pop up correctly on dest > after postcopy failed, it should be good. > > The auto test case can be for later. > > [...] > > > > > @@ -3094,27 +3099,24 @@ out: > > > > return ret; > > > > } > > > > > > > > -int qemu_loadvm_state(QEMUFile *f) > > > > +int qemu_loadvm_state(QEMUFile *f, Error **errp) > > > > { > > > > MigrationState *s = migrate_get_current(); > > > > MigrationIncomingState *mis = migration_incoming_get_current(); > > > > - Error *local_err = NULL; > > > > int ret; > > > > > > > > - if (qemu_savevm_state_blocked(&local_err)) { > > > > - error_report_err(local_err); > > > > + if (qemu_savevm_state_blocked(errp)) { > > > > > > Another thing to be careful here: I didn't check whether errp can be NULL > > > here, likely it can. > > > > > > In that case we'd better keep local_err, because error_setg() (inside of > > > qemu_savevm_state_blocked) will ignore the error otherwise.. > > > > > > static void error_setv(Error **errp, > > > const char *src, int line, const char *func, > > > ErrorClass err_class, const char *fmt, va_list ap, > > > const char *suffix) > > > { > > > Error *err; > > > int saved_errno = errno; > > > > > > if (errp == NULL) { > > > return; > > > } > > > assert(*errp == NULL); > > > ... > > > } > > > > > > So here we can keep local_err but use error_propagate(). > > > > Please correct me if I am wrong, as far as I have checked, > > qemu_loadvm_state() is called at 3 places > > - load_snapshot > > - qmp_xen_load_devices_state > > - process_incoming_migration_co > > and in all these functions, Error **errp is passed. > > I did not find a function that passes NULL. > > Is it still required to declare and pass a local_err object? > > In that case (this should include all the callers of the three call sites > never passing in NULL) we're almost asserting errp!=NULL, then it should be > OK. > > Thanks, > > -- > Peter Xu > Thanks, Arun