On 2025-09-30 16:04, Peter Xu wrote: > On Tue, Sep 30, 2025 at 09:53:31AM +0200, Jiří Denemark wrote: > > On Thu, Sep 25, 2025 at 14:22:06 -0400, Peter Xu wrote: > > > On Thu, Sep 25, 2025 at 01:54:40PM +0200, Jiří Denemark wrote: > > > > On Mon, Sep 15, 2025 at 13:59:15 +0200, Juraj Marcin wrote: > > > > > From: Juraj Marcin <[email protected]> > > > > > > > > > > Currently, when postcopy starts, the source VM starts switchover and > > > > > sends a package containing the state of all non-postcopiable devices. > > > > > When the destination loads this package, the switchover is complete > > > > > and > > > > > the destination VM starts. However, if the device state load fails or > > > > > the destination side crashes, the source side is already in > > > > > POSTCOPY_ACTIVE state and cannot be recovered, even when it has the > > > > > most > > > > > up-to-date machine state as the destination has not yet started. > > > > > > > > > > This patch introduces a new POSTCOPY_DEVICE state which is active > > > > > while the destination machine is loading the device state, is not yet > > > > > running, and the source side can be resumed in case of a migration > > > > > failure. > > > > > > > > > > To transition from POSTCOPY_DEVICE to POSTCOPY_ACTIVE, the source > > > > > side uses a PONG message that is a response to a PING message > > > > > processed > > > > > just before the POSTCOPY_RUN command that starts the destination VM. > > > > > Thus, this change does not require any changes on the destination side > > > > > and is effective even with older destination versions. > > > > > > > > Thanks, this will help libvirt as we think that the migration can be > > > > safely aborted unless we successfully called "cont" and thus we just > > > > kill QEMU on the destination. But since QEMU on the source already > > > > entered postcopy-active, we can't cancel the migration and the result is > > > > a paused VM with no way of recovering it. > > > > > > > > This series will make the situation better as the source will stay in > > > > postcopy-device until the destination successfully loads device data. > > > > There's still room for some enhancement though. Depending on how fast > > > > this loading is libvirt may issue cont before device data is loaded (the > > > > destination is already in postcopy-active at this point), which always > > > > succeeds as it only marks the domain to be autostarted, but the actual > > > > start may fail later. When discussing this with Juraj we agreed on > > > > introducing the new postcopy-device state on the destination as well to > > > > > > I used to think and define postcopy-active be the state we should never be > > > able to cancel it anymore, implying that the real postcopy process is in > > > progress, and also implying the state where we need to start assume the > > > latest VM pages are spread on both sides, not one anymore. Cancellation > > > or > > > killing either side means crashing VM then. > > > > Right, although it's unfortunately not the case now as the source is in > > postcopy-active even though the complete state is still on the source. > > > > > It could be a good thing indeed to have postcopy-device on dest too from > > > that regard, because having postcopy-device on dest can mark out the small > > > initial window when dest qemu hasn't yet start to generate new data but > > > only applying old data (device data, which src also owns a copy). From > > > that POV, that indeed does not belong to the point if we define > > > postcopy-active as above. > > > > > > IOW, also with such definition, setting postcopy-active on dest QEMU right > > > at the entry of ram load thread (what we do right now..) is too early. > > > > > > > make sure libvirt will only call cont once device data was successfully > > > > loaded so that we always get a proper result when running cont. But it > > > > > > Do we know an estimate of how much extra downtime this would introduce? > > > > > > We should have discussed this in a previous thread, the issue is if we > > > cont > > > only after device loaded, then dest QEMU may need to wait a while until it > > > receives the cont from libvirt, that will contribute to the downtime. It > > > would best be avoided, or if it's negligible then it's fine too but I'm > > > not > > > sure whether it's guaranteed to be negligible.. > > > > We start QEMU with -S so it always needs to wait for cont from libvirt. > > We wait for postcopy-active on the destination before sending cont. So > > currently it can arrive while QEMU is still loading device state or when > > this is already done. I was just suggesting to always wait for the > > device state to be loaded before sending cont. So in some cases it would > > arrive a bit later while in other cases nothing would change. It's just > > a matter of waking up a thread waiting for postcopy-active and sending > > the command back to QEMU. There's no communication with the other host > > at this point so I'd expect the difference to be negligible. And as I > > said depending on how fast device state loading vs transferring > > migration control from libvirt on the source to the destination we may > > already be sending cont when QEMU is done. > > Ah OK, I think this is not a major concern, until it is justified to. > > > > > But anyway, this would only be helpful if there's a way to actually > > cancel migration on the source when cont fails. > > > > > If the goal is to make sure libvirt knows what is happening, can it still > > > relies on the event emitted, in this case, RESUME? We can also reorg how > > > postcopy-device and postcopy-active states will be reported on dest, then > > > they'll help if RESUME is too coarse grained. > > > > The main goal is to make sure we don't end up with vCPUs paused on both > > sides during a postcopy migration that can't be recovered nor canceled > > thus effectively crashing the VM. > > Right, I assume that's what Juraj's series is trying to fix. After this > series lands, I don't see why it would happen. But indeed I'm still > expecting the block drive (including their locks) to behave all fine.
My POSTCOPY_DEVICE series resolves failures during the device state load, that is up to MIG_CMD_PING that is just before the MIG_CMD_POSTCOPY_RUN command. However, if the destination fails during resume (POSTCOPY_RUN), the source machine cannot be recovered even with this series, it has already received PONG and switched to postcopy-active. It is also not possible to move the PING after POSTCOPY_RUN without changing how the destination interprets the migration stream and the source would also need to know if destination supports the feature. > > > > > > So far, dest QEMU will try to resume the VM after getting RUN command, > > > that > > > is what loadvm_postcopy_handle_run_bh() does, and it will (when > > > autostart=1 > > > set): (1) firstly try to activate all block devices, iff it succeeded, (2) > > > do vm_start(), at the end of which RESUME event will be generated. So > > > RESUME currently implies both disk activation success, and vm start > > > worked. > > > > > > > may still fail when locking disks fails (not sure if this is the only > > > > way cont may fail). In this case we cannot cancel the migration on the > > > > > > Is there any known issue with locking disks that dest would fail? This > > > really sound like we should have the admin taking a look. > > > > Oh definitely, it would be some kind of an storage access issue on the > > destination. But we'd like to give the admin an option to actually do > > anything else than just killing the VM :-) Either by automatically > > canceling the migration or allowing recovery once storage issues are > > solved. > > The problem is, if the storage locking stopped working properly, then how > to guarantee the shared storage itself is working properly? > > When I was replying previously, I was expecting the admin taking a look to > fix the storage, I didn't expect the VM can still be recovered anymore if > there's no confidence that the block devices will work all fine. The > locking errors to me may imply a block corruption already, or should I not > see it like that? > > Fundamentally, "crashing the VM" doesn't lose anything from the block POV > because it's always persistent when synced. It's almost only about RAM > that is getting lost, alongside it's about task status, service > availability, and the part of storage that was not flushed to backends. > > Do we really want to add anything more complex when shared storage has > locking issues? Maybe there's known issues on locking that we're 100% sure > the storage is fine, but only the locking went wrong? > > IIUC, the hope is after this series lands we should close the gap for > almost all the rest paths that may cause both sides to HALT for a postcopy, > except for a storage issue with lockings. But I'm not sure if I missed > something. > > Thanks, > > -- > Peter Xu >
