On 25/06/2024 15:53, Peter Xu wrote:
> On Tue, Jun 25, 2024 at 12:38:50PM +0100, Joao Martins wrote:
>> On 24/06/2024 20:41, Peter Xu wrote:
>>> On Fri, Jun 21, 2024 at 07:32:21AM -0700, Elena Ufimtseva wrote:
>>>> @@ -2659,6 +2698,18 @@ qemu_loadvm_section_start_full(QEMUFile *f, 
>>>> MigrationIncomingState *mis,
>>  >>      if (!check_section_footer(f, se)) {
>>>>          return -EINVAL;
>>>> @@ -2714,6 +2765,19 @@ qemu_loadvm_section_part_end(QEMUFile *f, 
>>>> MigrationIncomingState *mis,
>>>>                                      se->instance_id, end_ts - start_ts);
>>>>      }
>>>>  
>>>> +    if (migrate_switchover_abort() && type == QEMU_VM_SECTION_END &&
>>>> +        mis->downtime_start) {
>>>> +        mis->downtime_now = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
>>>> +        uint64_t dst_downtime = mis->downtime_now - mis->downtime_start;
>>>> +        if (mis->src_downtime + dst_downtime >= mis->abort_limit) {
>>>> +            error_report("Shutdown destination migration, migration 
>>>> abort_limit (%lu ms)"
>>>> +                          "was reached.", mis->abort_limit);
>>>> +            trace_qemu_loadvm_downtime_abort(mis->abort_limit, 
>>>> dst_downtime,
>>>> +                                             mis->src_downtime);
>>>> +            return -EINVAL;
>>>> +        }
>>>> +    }
>>>
>>> So this traps both iteration and non-iteration phase.  What if the downtime
>>> was caused by after these, or unluckily at the last non-iterator device
>>> state?
>>>
>>> After trying to think about it, I figured this is not easy to do right.
>>> Also, I start to doubt whether it's definitely a good idea on having this
>>> in the first place.
>>>
>>> Firstly, I'm wondering how we should treat this new feature
>>> v.s. downtime_limit.  It's about how the user should set both.
>>>
>>> If this is about "cancel migration if downtime more than xxx",
>>> then.. AFAICT that's exactly what downtime_limit was "designed" to be..
>>> It's just that downtime_limit says the other way round, as: "don't
>>> switchover if the downtime will be more than xxx".
>>>
>>> Then, if the user has option to set both these parameters, what would be
>>> the right thing to do?  Shouldn't they simply always set both parameters to
>>> be the same value already?  But if so, what's the point on having two?
>>>
>>> This caused me to think whether the 2nd parameter is needed at all, instead
>>> whether we should simply make downtime_limit more accurate, so that it will
>>> start to account more things than before.  It won't be accurate, but the
>>> same case here: making this new feature "accurate" can also be very hard.
>>>
>>
>> The way I think about it is that current downtime-limit captures nicely the 
>> data
>> part as the only calculations it cares about is how much outstanding data it
>> sends to the destination (be it VF device state or RAM). This second 
>> parameter
>> captures what is *not* related to data, i.e. costs of hypervisor quiescing 
>> the
>> VM or added latencies in hypervisor *in addition* to sending outstanding 
>> data to
>> destination.
>>
>> If we were to merge this all into a single parameter (aka downtime-limit) we 
>> are
>> possibility artificially increasing the downtime thanks to relaxing the
>> oustanding data part, as opposed to trying to capture the switchover cost 
>> (hence
>> the name the parameter) that can't be reflected in the former equation.
>>
>> So I feel like this needs two parameters whereby this second new parameter
>> covers 'switchover cost' (hence the name) with current migration algorithm.
> 
> Then the question is how should we suggest the user to specify these two
> parameters.
> 
> The cover letter used:
> 
>   migrate_set_parameter downtime-limit 300
>   migrate_set_parameter switchover-limit 10
> 
> But it does look like a testing sample only and not valid in real world
> setup, as logically the new limit should be larger than the old one,
> afaict.  If the new limit is larger, how larger should it be?
> 
> So far I can understand how it works intenrally, but even with that I don't
> know how I should set this parameter, e.g., when downtime_limit used to be
> 300ms, I'd say 500ms could be a suitable value, but is it?  In that case,
> perhaps the 500ms is the "real" limit that I don't want a VM to be halted
> for more than 500ms, but in that case the user should have already setup
> downtime_limit to 500ms.
> 

Yeap -- I think you're right that it presents a UX confusion on what should a
user should set.

> I also don't know how should an user understand all these details and
> figure out how to set these.  Note that currently we definitely allow the
> user to specify downtime_limit and it's so far understandable, even though
> the user may assume that contains the whole blackout phase (that's also
> what we wish it can achieve..), rather than the fact that it only takes
> ram/vfio/... reported remaining data into account v.s. the bandwidth.
> 
> Maybe we could consider introducing the parameters/caps in some other form,
> so that we can keep the downtime_limit to be "the total downtime", instead
> of a new definition of downtime.  E.g., it's at least not fair indeed to
> take the whole "downtime_limit" for data transfers, so maybe some ratio
> parameter can be introduced to say how much portion of that downtime can be
> used for data transfer 

I like this idea -- it fixes another problem that downtime-limit (solely) makes
the algorithm be too optimistic and just utilize the whole bandwidth (e.g.
migrating after the first dirty sync depending on downtime-limit)

e.g. "iterable-downtime" (or "precopy-downtime" for lack of a better idea) for
such proerty

Then when we set downtime-limit it can represent the whole thing including
blackout as it is the case today and we configure it by minimizing
iterable-downtime to give enough headroom for switchover.

>, and then it might be ok to work with the new cap
> introduced so that when total downtime is over the limit it will abort the
> migration (but without a new "max downtime" definition which might be
> confusing).
> 

Yes, improving 'downtime-limit' with a sub parameter above for RAM/state
downtime, this migration capability then becomes a boolean instead of a value
where it's more like 'downtime-limit-strict' and going above downtime limit is
enforced/aborted is also performed like these patches. To prevent essentially 
this:

> IOW, I wonder whether there can
> be case where user specifies these parameters, migration simply keeps
> switching over and keep aborting, requiring a retry.  That's pretty
> unwanted.  We may simply doesn't allow that switchover to happen at all.

(...)

> Said that, I also wonder whether you have thought about improving
> downtime_limit itself.  It'll be definitely better if we simply don't
> switchover at all if that won't make it.  

The device-state multifd scaling is a take on improving switchover phase, and we
will keep improving it whenever we find things... but the switchover itself
can't be 'precomputed' into a downtime number equation ahead of time to
encompass all possible latencies/costs. Part of the reason that at least we
couldn't think of a way besides this proposal here, which at the core it's meant
to bounds check switchover. Even without taking into account VFs/HW[0], it is
simply not considered how long it might take and giving some sort of downtime
buffer coupled with enforcement that can be enforced helps not violating
migration SLAs.

[0]
https://lore.kernel.org/qemu-devel/20230317081904.24389-1-xuchuangxc...@bytedance.com/

Reply via email to