On 10/22/24 11:38, Avihai Horon wrote:
On 21/10/2024 19:54, Peter Xu wrote:
External email: Use caution opening links or attachments
On Mon, Oct 21, 2024 at 06:43:13PM +0200, Cédric Le Goater wrote:
Hello,
IIUC the migration thread should always see valid migration object, as it
takes one refcount at the entrance of migration_thread():
object_ref(OBJECT(s));
Could the migration have failed before ? in migrate_fd_connect()
I just noticed it's a vm state change notifier..
Yep.
I stumbled upon this bug by accident when running on a buggy machine.
Migration wasn't involved, I just started the VM, shut it down and got the assert
(as my VFIO device was faulty and errored on RUNNING->STOP state change).
You can repro it by forcefully triggering an error on *->STOP transition:
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 17199b73ae..d41cb7c634 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -831,7 +831,9 @@ static void vfio_vmstate_change(void *opaque, bool running,
RunState state)
}
ret = vfio_migration_set_state_or_reset(vbasedev, new_state, &local_err);
- if (ret) {
+ if (ret || new_state == VFIO_DEVICE_STATE_STOP) {
+ ret = -1;
+ error_setg(&local_err, "%s: forced error", vbasedev->name);
/*
* Migration should be aborted in this case, but vm_state_notify()
* currently does not support reporting failures.
If so, maybe VFIO could refer to its internal states showing that it's
during a migration first (so as to guarantee the migration object is valid;
e.g., only after save_setup() but before save_cleanup() hooks are invoked).
It's an option, but I think it's a bit awkward as we'd need to check that VFIOMigration->data_buffer is set
That's what I was looking at too. It works. It feels a bit awkward
indeed. We could hide the details in an helper routine though.
or add a new flag.
yes that's another solution.
Peter,
I wonder if we could grab a ref on current_migration in save_setup(),
store it under VFIOMigration and drop it save_cleanup() ?
Besides that, as Cedric pointed out, VFIO code calls
migration_is_setup_or_active() which can also be unsafe, as it might be invoked
after migration object has been freed.
Maybe a simpler solution would be to extend the the migration object lifetime?
Looking at commit history, you can see that commit 1f8956041ad3 ("migration:
finalize current_migration object") added migration object finalization at the very
end of qemu cleanup.
Then came commit 892ae715b6bc ("migration: Cleanup during exit") and moved the
migration object finalization to the beginning of qemu cleanup (before stopping the VM
etc.).
If so, the fix could be something like the below?
-------------8<-------------
diff --git a/include/migration/misc.h b/include/migration/misc.h
index bfadc5613b..5eb099349a 100644
--- a/include/migration/misc.h
+++ b/include/migration/misc.h
@@ -52,6 +52,7 @@ void dump_vmstate_json_to_file(FILE *out_fp);
/* migration/migration.c */
void migration_object_init(void);
+void migration_object_finalize(void);
void migration_shutdown(void);
bool migration_is_idle(void);
bool migration_is_active(void);
diff --git a/migration/migration.c b/migration/migration.c
index 021faee2f3..9eaa7947bc 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -265,6 +265,11 @@ void migration_object_init(void)
dirty_bitmap_mig_init();
}
+void migration_object_finalize(void)
+{
+ object_unref(OBJECT(current_migration));
+}
+
typedef struct {
QEMUBH *bh;
QEMUBHFunc *cb;
@@ -330,7 +335,6 @@ void migration_shutdown(void)
* stop the migration using this structure
*/
migration_cancel(NULL);
- object_unref(OBJECT(current_migration));
/*
* Cancel outgoing migration of dirty bitmaps. It should
diff --git a/system/runstate.c b/system/runstate.c
index c2c9afa905..fa823f5e72 100644
--- a/system/runstate.c
+++ b/system/runstate.c
@@ -930,5 +930,6 @@ void qemu_cleanup(int status)
monitor_cleanup();
qemu_chr_cleanup();
user_creatable_cleanup();
+ migration_object_finalize();
/* TODO: unref root container, check all devices are ok */
}
-------------8<-------------
892ae715b6bc was trying to fix potential use-after-free issues.
I think it is safer to introduce an helper routine checking
(in some ways) if a migration is in progress than partially
revert 892ae715b6bc.
Thanks,
C.