Re: [PATCH RFC 2/2] migration: abort on destination if switchover limit exceeded

Peter Xu Tue, 25 Jun 2024 17:25:21 -0700

On Tue, Jun 25, 2024 at 05:31:19PM +0100, Joao Martins wrote:
> The device-state multifd scaling is a take on improving switchover phase,
> and we will keep improving it whenever we find things... but the


That'll be helpful, thanks.  Just a quick note that "reducing downtime" is
a separate issue comparing to "make downtime_limit accurate".

> switchover itself can't be 'precomputed' into a downtime number equation
> ahead of time to encompass all possible latencies/costs. Part of the
> reason that at least we couldn't think of a way besides this proposal
> here, which at the core it's meant to bounds check switchover. Even
> without taking into account VFs/HW[0], it is simply not considered how
> long it might take and giving some sort of downtime buffer coupled with
> enforcement that can be enforced helps not violating migration SLAs.

I agree such enforcement alone can be useful in general to be able to
fallback.  Said that, I think it would definitely be nice to attach more
information on the downtime analysis when reposting this series, if there
is any.

For example, irrelevant of whether QEMU can do proper predictions at all,
there can be data / results to show what is the major parts that are
missing besides the current calculations, aka an expectation on when the
fallback can trigger, and some justification on why they can't be
predicted.

IMHO the enforcement won't make much sense if it keeps triggering, in that
case people will simply not use it as it stops migrations from happening.
Ultimately the work will still be needed to make downtime_limit accurate.
The fallback should only be an last fence to guard the promise which should
be the "corner cases".

-- 
Peter Xu

Re: [PATCH RFC 2/2] migration: abort on destination if switchover limit exceeded

Reply via email to