On 10/22/24 11:38, Avihai Horon wrote:

On 21/10/2024 19:54, Peter Xu wrote:
External email: Use caution opening links or attachments


On Mon, Oct 21, 2024 at 06:43:13PM +0200, Cédric Le Goater wrote:
Hello,

IIUC the migration thread should always see valid migration object, as it
takes one refcount at the entrance of migration_thread():

      object_ref(OBJECT(s));
Could the migration have failed before ? in migrate_fd_connect()
I just noticed it's a vm state change notifier..

Yep.
I stumbled upon this bug by accident when running on a buggy machine.
Migration wasn't involved, I just started the VM, shut it down and got the assert 
(as my VFIO device was faulty and errored on RUNNING->STOP state change).

You can repro it by forcefully triggering an error on *->STOP transition:

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 17199b73ae..d41cb7c634 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -831,7 +831,9 @@ static void vfio_vmstate_change(void *opaque, bool running, 
RunState state)
      }

      ret = vfio_migration_set_state_or_reset(vbasedev, new_state, &local_err);
-    if (ret) {
+    if (ret || new_state == VFIO_DEVICE_STATE_STOP) {
+        ret = -1;
+        error_setg(&local_err, "%s: forced error", vbasedev->name);
          /*
           * Migration should be aborted in this case, but vm_state_notify()
           * currently does not support reporting failures.


If so, maybe VFIO could refer to its internal states showing that it's
during a migration first (so as to guarantee the migration object is valid;
e.g., only after save_setup() but before save_cleanup() hooks are invoked).

It's an option, but I think it's a bit awkward as we'd need to check that VFIOMigration->data_buffer is set

That's what I was looking at too. It works. It feels a bit awkward
indeed. We could hide the details in an helper routine though.

or add a new flag.

yes that's another solution.

Peter,

I wonder if we could grab a ref on current_migration in save_setup(),
store it under VFIOMigration and drop it save_cleanup() ?


Besides that, as Cedric pointed out, VFIO code calls 
migration_is_setup_or_active() which can also be unsafe, as it might be invoked 
after migration object has been freed.

Maybe a simpler solution would be to extend the the migration object lifetime?
Looking at commit history, you can see that commit 1f8956041ad3 ("migration: 
finalize current_migration object") added migration object finalization at the very 
end of qemu cleanup.
Then came commit 892ae715b6bc ("migration: Cleanup during exit") and moved the 
migration object finalization to the beginning of qemu cleanup (before stopping the VM 
etc.).

If so, the fix could be something like the below?

-------------8<-------------
diff --git a/include/migration/misc.h b/include/migration/misc.h
index bfadc5613b..5eb099349a 100644
--- a/include/migration/misc.h
+++ b/include/migration/misc.h
@@ -52,6 +52,7 @@ void dump_vmstate_json_to_file(FILE *out_fp);

  /* migration/migration.c */
  void migration_object_init(void);
+void migration_object_finalize(void);
  void migration_shutdown(void);
  bool migration_is_idle(void);
  bool migration_is_active(void);
diff --git a/migration/migration.c b/migration/migration.c
index 021faee2f3..9eaa7947bc 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -265,6 +265,11 @@ void migration_object_init(void)
      dirty_bitmap_mig_init();
  }

+void migration_object_finalize(void)
+{
+    object_unref(OBJECT(current_migration));
+}
+
  typedef struct {
      QEMUBH *bh;
      QEMUBHFunc *cb;
@@ -330,7 +335,6 @@ void migration_shutdown(void)
       * stop the migration using this structure
       */
      migration_cancel(NULL);
-    object_unref(OBJECT(current_migration));

      /*
       * Cancel outgoing migration of dirty bitmaps. It should
diff --git a/system/runstate.c b/system/runstate.c
index c2c9afa905..fa823f5e72 100644
--- a/system/runstate.c
+++ b/system/runstate.c
@@ -930,5 +930,6 @@ void qemu_cleanup(int status)
      monitor_cleanup();
      qemu_chr_cleanup();
      user_creatable_cleanup();
+    migration_object_finalize();
      /* TODO: unref root container, check all devices are ok */
  }
-------------8<-------------

892ae715b6bc was trying to fix potential use-after-free issues.

I think it is safer to introduce an helper routine checking
(in some ways) if a migration is in progress than partially
revert 892ae715b6bc.

Thanks,

C.


Reply via email to