Stefan Hajnoczi <[email protected]> writes:

> On Mon, Dec 08, 2025 at 02:51:01PM +0100, Thomas Huth wrote:
>> From: Thomas Huth <[email protected]>
>> 
>> When shutting down a guest that is currently in progress of being
>> migrated, there is a chance that QEMU might crash during bdrv_delete().
>> The backtrace looks like this:
>> 
>>  Thread 74 "mig/src/main" received signal SIGSEGV, Segmentation fault.
>> 
>>  [Switching to Thread 0x3f7de7fc8c0 (LWP 2161436)]
>>  0x000002aa00664012 in bdrv_delete (bs=0x2aa00f875c0) at 
>> ../../devel/qemu/block.c:5560
>>  5560                QTAILQ_REMOVE(&graph_bdrv_states, bs, node_list);
>>  (gdb) bt
>>  #0  0x000002aa00664012 in bdrv_delete (bs=0x2aa00f875c0) at 
>> ../../devel/qemu/block.c:5560
>>  #1  bdrv_unref (bs=0x2aa00f875c0) at ../../devel/qemu/block.c:7170
>>  Backtrace stopped: Cannot access memory at address 0x3f7de7f83e0
>> 

How does the migration thread reaches here? Is this from
migration_block_inactivate()?

>> The problem is apparently that the migration thread is still active
>> (migration_shutdown() only asks it to stop the current migration, but
>> does not wait for it to finish)

"asks it to stop", more like pulls the plug abruptly. Note that setting
the CANCELLING state has technically nothing to do with this, the actual
cancelling lies on the not so gentle:

if (s->to_dst_file) {
    qemu_file_shutdown(s->to_dst_file);
}
 
>> , while the main thread continues to
>> bdrv_close_all() that will destroy all block drivers. So the two threads
>> are racing here for the destruction of the migration-related block drivers.
>> 
>> I was able to bisect the problem and the race has apparently been introduced
>> by commit c2a189976e211c9ff782 ("migration/block-active: Remove global active
>> flag"), so reverting it might be an option as well, but waiting for the
>> migration thread to finish before continuing with the further clean-ups
>> during shutdown seems less intrusive.
>> 
>> Note: I used the Claude AI assistant for analyzing the crash, and it
>> came up with the idea of waiting for the migration thread to finish
>> in migration_shutdown() before proceeding with the further clean-up,
>> but the patch itself has been 100% written by myself.
>
> It sounds like the migration thread does not hold block graph refcounts
> and assumes the BlockDriverStates it uses have a long enough lifetime.
>
> I don't know the migration code well enough to say whether joining in
> migration_shutdown() is okay. Another option would be expicitly holding
> the necessary refcounts in the migration thread.
>

I agree. In principle and also because shuffling the joining around
feels like something that's prone to introduce other bugs.

>> 
>> Fixes: c2a189976e ("migration/block-active: Remove global active flag")
>> Signed-off-by: Thomas Huth <[email protected]>
>> ---
>>  migration/migration.c | 24 ++++++++++++++++++------
>>  1 file changed, 18 insertions(+), 6 deletions(-)
>> 
>> diff --git a/migration/migration.c b/migration/migration.c
>> index b316ee01ab2..6f4bb6d8438 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -380,6 +380,16 @@ void migration_bh_schedule(QEMUBHFunc *cb, void *opaque)
>>      qemu_bh_schedule(bh);
>>  }
>>  
>> +static void migration_thread_join(MigrationState *s)
>> +{
>> +    if (s && s->migration_thread_running) {
>> +        bql_unlock();
>> +        qemu_thread_join(&s->thread);
>> +        s->migration_thread_running = false;
>> +        bql_lock();
>> +    }
>> +}
>> +
>>  void migration_shutdown(void)
>>  {
>>      /*
>> @@ -393,6 +403,13 @@ void migration_shutdown(void)
>>       * stop the migration using this structure
>>       */
>>      migration_cancel();
>> +    /*
>> +     * Wait for migration thread to finish to prevent a possible race where
>> +     * the migration thread is still running and accessing host block 
>> drivers
>> +     * while the main cleanup proceeds to remove them in bdrv_close_all()
>> +     * later.
>> +     */
>> +    migration_thread_join(migrate_get_current());
>>      object_unref(OBJECT(current_migration));
>>  
>>      /*
>> @@ -1499,12 +1516,7 @@ static void migration_cleanup(MigrationState *s)
>>  
>>      close_return_path_on_source(s);
>>  
>> -    if (s->migration_thread_running) {
>> -        bql_unlock();
>> -        qemu_thread_join(&s->thread);
>> -        s->migration_thread_running = false;
>> -        bql_lock();
>> -    }
>> +    migration_thread_join(s);
>>  
>>      WITH_QEMU_LOCK_GUARD(&s->qemu_file_lock) {
>>          /*
>> -- 
>> 2.52.0
>> 

Reply via email to