On 25/06/2024 15:53, Peter Xu wrote: > On Tue, Jun 25, 2024 at 12:38:50PM +0100, Joao Martins wrote: >> On 24/06/2024 20:41, Peter Xu wrote: >>> On Fri, Jun 21, 2024 at 07:32:21AM -0700, Elena Ufimtseva wrote: >>>> @@ -2659,6 +2698,18 @@ qemu_loadvm_section_start_full(QEMUFile *f, >>>> MigrationIncomingState *mis, >> >> if (!check_section_footer(f, se)) { >>>> return -EINVAL; >>>> @@ -2714,6 +2765,19 @@ qemu_loadvm_section_part_end(QEMUFile *f, >>>> MigrationIncomingState *mis, >>>> se->instance_id, end_ts - start_ts); >>>> } >>>> >>>> + if (migrate_switchover_abort() && type == QEMU_VM_SECTION_END && >>>> + mis->downtime_start) { >>>> + mis->downtime_now = qemu_clock_get_ms(QEMU_CLOCK_REALTIME); >>>> + uint64_t dst_downtime = mis->downtime_now - mis->downtime_start; >>>> + if (mis->src_downtime + dst_downtime >= mis->abort_limit) { >>>> + error_report("Shutdown destination migration, migration >>>> abort_limit (%lu ms)" >>>> + "was reached.", mis->abort_limit); >>>> + trace_qemu_loadvm_downtime_abort(mis->abort_limit, >>>> dst_downtime, >>>> + mis->src_downtime); >>>> + return -EINVAL; >>>> + } >>>> + } >>> >>> So this traps both iteration and non-iteration phase. What if the downtime >>> was caused by after these, or unluckily at the last non-iterator device >>> state? >>> >>> After trying to think about it, I figured this is not easy to do right. >>> Also, I start to doubt whether it's definitely a good idea on having this >>> in the first place. >>> >>> Firstly, I'm wondering how we should treat this new feature >>> v.s. downtime_limit. It's about how the user should set both. >>> >>> If this is about "cancel migration if downtime more than xxx", >>> then.. AFAICT that's exactly what downtime_limit was "designed" to be.. >>> It's just that downtime_limit says the other way round, as: "don't >>> switchover if the downtime will be more than xxx". >>> >>> Then, if the user has option to set both these parameters, what would be >>> the right thing to do? Shouldn't they simply always set both parameters to >>> be the same value already? But if so, what's the point on having two? >>> >>> This caused me to think whether the 2nd parameter is needed at all, instead >>> whether we should simply make downtime_limit more accurate, so that it will >>> start to account more things than before. It won't be accurate, but the >>> same case here: making this new feature "accurate" can also be very hard. >>> >> >> The way I think about it is that current downtime-limit captures nicely the >> data >> part as the only calculations it cares about is how much outstanding data it >> sends to the destination (be it VF device state or RAM). This second >> parameter >> captures what is *not* related to data, i.e. costs of hypervisor quiescing >> the >> VM or added latencies in hypervisor *in addition* to sending outstanding >> data to >> destination. >> >> If we were to merge this all into a single parameter (aka downtime-limit) we >> are >> possibility artificially increasing the downtime thanks to relaxing the >> oustanding data part, as opposed to trying to capture the switchover cost >> (hence >> the name the parameter) that can't be reflected in the former equation. >> >> So I feel like this needs two parameters whereby this second new parameter >> covers 'switchover cost' (hence the name) with current migration algorithm. > > Then the question is how should we suggest the user to specify these two > parameters. > > The cover letter used: > > migrate_set_parameter downtime-limit 300 > migrate_set_parameter switchover-limit 10 > > But it does look like a testing sample only and not valid in real world > setup, as logically the new limit should be larger than the old one, > afaict. If the new limit is larger, how larger should it be? > > So far I can understand how it works intenrally, but even with that I don't > know how I should set this parameter, e.g., when downtime_limit used to be > 300ms, I'd say 500ms could be a suitable value, but is it? In that case, > perhaps the 500ms is the "real" limit that I don't want a VM to be halted > for more than 500ms, but in that case the user should have already setup > downtime_limit to 500ms. >
Yeap -- I think you're right that it presents a UX confusion on what should a user should set. > I also don't know how should an user understand all these details and > figure out how to set these. Note that currently we definitely allow the > user to specify downtime_limit and it's so far understandable, even though > the user may assume that contains the whole blackout phase (that's also > what we wish it can achieve..), rather than the fact that it only takes > ram/vfio/... reported remaining data into account v.s. the bandwidth. > > Maybe we could consider introducing the parameters/caps in some other form, > so that we can keep the downtime_limit to be "the total downtime", instead > of a new definition of downtime. E.g., it's at least not fair indeed to > take the whole "downtime_limit" for data transfers, so maybe some ratio > parameter can be introduced to say how much portion of that downtime can be > used for data transfer I like this idea -- it fixes another problem that downtime-limit (solely) makes the algorithm be too optimistic and just utilize the whole bandwidth (e.g. migrating after the first dirty sync depending on downtime-limit) e.g. "iterable-downtime" (or "precopy-downtime" for lack of a better idea) for such proerty Then when we set downtime-limit it can represent the whole thing including blackout as it is the case today and we configure it by minimizing iterable-downtime to give enough headroom for switchover. >, and then it might be ok to work with the new cap > introduced so that when total downtime is over the limit it will abort the > migration (but without a new "max downtime" definition which might be > confusing). > Yes, improving 'downtime-limit' with a sub parameter above for RAM/state downtime, this migration capability then becomes a boolean instead of a value where it's more like 'downtime-limit-strict' and going above downtime limit is enforced/aborted is also performed like these patches. To prevent essentially this: > IOW, I wonder whether there can > be case where user specifies these parameters, migration simply keeps > switching over and keep aborting, requiring a retry. That's pretty > unwanted. We may simply doesn't allow that switchover to happen at all. (...) > Said that, I also wonder whether you have thought about improving > downtime_limit itself. It'll be definitely better if we simply don't > switchover at all if that won't make it. The device-state multifd scaling is a take on improving switchover phase, and we will keep improving it whenever we find things... but the switchover itself can't be 'precomputed' into a downtime number equation ahead of time to encompass all possible latencies/costs. Part of the reason that at least we couldn't think of a way besides this proposal here, which at the core it's meant to bounds check switchover. Even without taking into account VFs/HW[0], it is simply not considered how long it might take and giving some sort of downtime buffer coupled with enforcement that can be enforced helps not violating migration SLAs. [0] https://lore.kernel.org/qemu-devel/20230317081904.24389-1-xuchuangxc...@bytedance.com/