Re: [PATCH 4/5] migration: Activate block devices if VM is paused when migrating

Andrey Drobyshev Wed, 27 Nov 2024 09:40:14 -0800

On 11/25/24 7:07 PM, Peter Xu wrote:
> On Mon, Nov 25, 2024 at 11:46:11AM -0300, Fabiano Rosas wrote:
>> Currently a VM that has been target of a migration using
>> late-block-activate will crash at the end of a new migration (with it
>> as source) when releasing ownership of the disks due to the VM having
>> never taken ownership of the disks in the first place.
>>
>> The issue is that late-block-activate expects a qmp_continue command
>> to be issued at some point on the destination VM after the migration
>> finishes. If the user decides to never continue the VM, but instead
>> issue a new migration, then bdrv_activate_all() will never be called
>> and the assert will be reached:
>>
>> bdrv_inactivate_recurse: Assertion `!(bs->open_flags &
>> BDRV_O_INACTIVE)' failed.
>>
>> Fix the issue by checking at the start of migration if the VM is
>> paused and call bdrv_activate_all() before migrating. Even if the
>> late-block-activate capability is not in play or if the VM has been
>> paused manually, there is no harm calling that function again.
>>
>> Signed-off-by: Fabiano Rosas <faro...@suse.de>
>> ---
>>  migration/migration.c | 19 +++++++++++++++++++
>>  1 file changed, 19 insertions(+)
>>
>> diff --git a/migration/migration.c b/migration/migration.c
>> index aedf7f0751..26af30137b 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -2029,6 +2029,25 @@ static bool migrate_prepare(MigrationState *s, bool 
>> resume, Error **errp)
>>          return false;
>>      }
>>  
>> +    /*
>> +     * The VM might have been target of a previous migration. If it
>> +     * was in the paused state then nothing will have required the
>> +     * block layer to be activated. Do it now to ensure this QEMU
>> +     * instance owns the disk locks.
>> +     */
>> +    if (!resume && runstate_check(RUN_STATE_PAUSED)) {
> 
> I hope this will cover all the cases that QEMU could overlook.. or I'm not
> sure whether we could invoke bdrv_activate_all() unconditionally, as it
> looks like safe to be used if all disks are active already.
> 
> I wished we don't need to bother with disk activation status at all,
> because IIUC migration could work all fine even if all disks are inactivate
> when preparing migration.. hence such change always looks like a workaround
> of a separate issue.
> 
>> +        Error *local_err = NULL;
>> +
>> +        g_assert(bql_locked());
>> +
>> +        bdrv_activate_all(&local_err);
>> +        if (local_err) {
>> +            error_propagate(errp, local_err);
>> +            return false;
>> +        }
>> +        s->block_inactive = false;
> 
> This var so far was only used within one migration iteration, and the var
> was only set in migration_completion_precopy() so far.  Now we're resetting
> it upfront of a migration.  I'm not 100% sure if it's needed, or should be
> put somewhere else.
>


This variable is unconditionally set in migration_completion_precopy()
and used as a flag for whether or not we should be deactivating disks in
qemu_savevm_state_complete_precopy():

>     /*                                                                        
>       
>      * Inactivate disks except in COLO, and track that we have done so in 
> order 
>      * to remember to reactivate them if migration fails or is cancelled.     
>       
>      */                                                                       
>       
>     s->block_inactive = !migrate_colo();                                      
>       
>     migration_rate_set(RATE_LIMIT_DISABLED);                                  
>                                                                               
>    
>     ret = qemu_savevm_state_complete_precopy(s->to_dst_file, false,           
>   
>                                              s->block_inactive);

Ideally we'd like to take our paused state into account right here and
skip inactivation.  However at this point during the 2nd migration (in
paused state) our current_run_state == RUN_STATE_FINISH_MIGRATE.  So if
we truly wanted to skip unnecessary bdrv_activate_all(), we'd have to
remember our paused state somewhere earlier on the call stack and pass
an additional flag for that. Personally I don't think this is any
cleaner than just blatantly calling bdrv_activate_all().

> In general, I saw the mention of other places that may also try to
> invalidate disks that used to be invalidated.  If that's the case, I wish
> we don't need to touch migration code at all, but instead if block layer
> can cope with "invalidate on top of invalidated disks" it'll be perfect.
> 

This sounds similar to my initial very naive fix, which we decided not
to take:
https://lists.nongnu.org/archive/html/qemu-devel/2024-09/msg04254.html

>> +    }
>> +
>>      return true;
>>  }
>>  
>> -- 
>> 2.35.3
>>
>

Re: [PATCH 4/5] migration: Activate block devices if VM is paused when migrating

Reply via email to