On 22/10/2024 16:24, Cédric Le Goater wrote:
External email: Use caution opening links or attachments


On 10/22/24 11:38, Avihai Horon wrote:

On 21/10/2024 19:54, Peter Xu wrote:
External email: Use caution opening links or attachments


On Mon, Oct 21, 2024 at 06:43:13PM +0200, Cédric Le Goater wrote:
Hello,

IIUC the migration thread should always see valid migration object, as it
takes one refcount at the entrance of migration_thread():

      object_ref(OBJECT(s));
Could the migration have failed before ? in migrate_fd_connect()
I just noticed it's a vm state change notifier..

Yep.
I stumbled upon this bug by accident when running on a buggy machine.
Migration wasn't involved, I just started the VM, shut it down and got the assert (as my VFIO device was faulty and errored on RUNNING->STOP state change).

You can repro it by forcefully triggering an error on *->STOP transition:

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 17199b73ae..d41cb7c634 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -831,7 +831,9 @@ static void vfio_vmstate_change(void *opaque, bool running, RunState state)
      }

      ret = vfio_migration_set_state_or_reset(vbasedev, new_state, &local_err);
-    if (ret) {
+    if (ret || new_state == VFIO_DEVICE_STATE_STOP) {
+        ret = -1;
+        error_setg(&local_err, "%s: forced error", vbasedev->name);
          /*
           * Migration should be aborted in this case, but vm_state_notify()
           * currently does not support reporting failures.


If so, maybe VFIO could refer to its internal states showing that it's
during a migration first (so as to guarantee the migration object is valid; e.g., only after save_setup() but before save_cleanup() hooks are invoked).

It's an option, but I think it's a bit awkward as we'd need to check that VFIOMigration->data_buffer is set

That's what I was looking at too. It works. It feels a bit awkward
indeed. We could hide the details in an helper routine though.

or add a new flag.

yes that's another solution.

Peter,

I wonder if we could grab a ref on current_migration in save_setup(),
store it under VFIOMigration and drop it save_cleanup() ?


Besides that, as Cedric pointed out, VFIO code calls migration_is_setup_or_active() which can also be unsafe, as it might be invoked after migration object has been freed.

Maybe a simpler solution would be to extend the the migration object lifetime? Looking at commit history, you can see that commit 1f8956041ad3 ("migration: finalize current_migration object") added migration object finalization at the very end of qemu cleanup. Then came commit 892ae715b6bc ("migration: Cleanup during exit") and moved the migration object finalization to the beginning of qemu cleanup (before stopping the VM etc.).

If so, the fix could be something like the below?

-------------8<-------------
diff --git a/include/migration/misc.h b/include/migration/misc.h
index bfadc5613b..5eb099349a 100644
--- a/include/migration/misc.h
+++ b/include/migration/misc.h
@@ -52,6 +52,7 @@ void dump_vmstate_json_to_file(FILE *out_fp);

  /* migration/migration.c */
  void migration_object_init(void);
+void migration_object_finalize(void);
  void migration_shutdown(void);
  bool migration_is_idle(void);
  bool migration_is_active(void);
diff --git a/migration/migration.c b/migration/migration.c
index 021faee2f3..9eaa7947bc 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -265,6 +265,11 @@ void migration_object_init(void)
      dirty_bitmap_mig_init();
  }

+void migration_object_finalize(void)
+{
+    object_unref(OBJECT(current_migration));
+}
+
  typedef struct {
      QEMUBH *bh;
      QEMUBHFunc *cb;
@@ -330,7 +335,6 @@ void migration_shutdown(void)
       * stop the migration using this structure
       */
      migration_cancel(NULL);
-    object_unref(OBJECT(current_migration));

      /*
       * Cancel outgoing migration of dirty bitmaps. It should
diff --git a/system/runstate.c b/system/runstate.c
index c2c9afa905..fa823f5e72 100644
--- a/system/runstate.c
+++ b/system/runstate.c
@@ -930,5 +930,6 @@ void qemu_cleanup(int status)
      monitor_cleanup();
      qemu_chr_cleanup();
      user_creatable_cleanup();
+    migration_object_finalize();
      /* TODO: unref root container, check all devices are ok */
  }
-------------8<-------------

892ae715b6bc was trying to fix potential use-after-free issues.

Yes, and it did, by ref/unref current_migration in migration_thread().
In addition, it added a migrate_fd_cancel() in the beginning of qemu cleanup to encourage migration_thread to quit. IIUC, these are the core changes of the commit, and the only reason to move "object_unref(OBJECT(current_migration))" from the end of qemu cleanup to the beginning of it was to put migration cleanup code in one place.

That's why I think moving "object_unref(OBJECT(current_migration))" back to the end of qemu cleanup is safe and shouldn't make much difference. It will delay freeing of MigrationState right to the end and will allow us to use the already existing helper (migration_is_setup_or_active()) to check migration status.

IMHO this seems like the cleanest solution, unless I am missing something.

Thanks.


I think it is safer to introduce an helper routine checking
(in some ways) if a migration is in progress than partially
revert 892ae715b6bc.

Thanks,

C.


Reply via email to