On 2025-09-30 16:04, Peter Xu wrote:
> On Tue, Sep 30, 2025 at 09:53:31AM +0200, Jiří Denemark wrote:
> > On Thu, Sep 25, 2025 at 14:22:06 -0400, Peter Xu wrote:
> > > On Thu, Sep 25, 2025 at 01:54:40PM +0200, Jiří Denemark wrote:
> > > > On Mon, Sep 15, 2025 at 13:59:15 +0200, Juraj Marcin wrote:
> > > > > From: Juraj Marcin <[email protected]>
> > > > > 
> > > > > Currently, when postcopy starts, the source VM starts switchover and
> > > > > sends a package containing the state of all non-postcopiable devices.
> > > > > When the destination loads this package, the switchover is complete 
> > > > > and
> > > > > the destination VM starts. However, if the device state load fails or
> > > > > the destination side crashes, the source side is already in
> > > > > POSTCOPY_ACTIVE state and cannot be recovered, even when it has the 
> > > > > most
> > > > > up-to-date machine state as the destination has not yet started.
> > > > > 
> > > > > This patch introduces a new POSTCOPY_DEVICE state which is active
> > > > > while the destination machine is loading the device state, is not yet
> > > > > running, and the source side can be resumed in case of a migration
> > > > > failure.
> > > > > 
> > > > > To transition from POSTCOPY_DEVICE to POSTCOPY_ACTIVE, the source
> > > > > side uses a PONG message that is a response to a PING message 
> > > > > processed
> > > > > just before the POSTCOPY_RUN command that starts the destination VM.
> > > > > Thus, this change does not require any changes on the destination side
> > > > > and is effective even with older destination versions.
> > > > 
> > > > Thanks, this will help libvirt as we think that the migration can be
> > > > safely aborted unless we successfully called "cont" and thus we just
> > > > kill QEMU on the destination. But since QEMU on the source already
> > > > entered postcopy-active, we can't cancel the migration and the result is
> > > > a paused VM with no way of recovering it.
> > > > 
> > > > This series will make the situation better as the source will stay in
> > > > postcopy-device until the destination successfully loads device data.
> > > > There's still room for some enhancement though. Depending on how fast
> > > > this loading is libvirt may issue cont before device data is loaded (the
> > > > destination is already in postcopy-active at this point), which always
> > > > succeeds as it only marks the domain to be autostarted, but the actual
> > > > start may fail later. When discussing this with Juraj we agreed on
> > > > introducing the new postcopy-device state on the destination as well to
> > > 
> > > I used to think and define postcopy-active be the state we should never be
> > > able to cancel it anymore, implying that the real postcopy process is in
> > > progress, and also implying the state where we need to start assume the
> > > latest VM pages are spread on both sides, not one anymore.  Cancellation 
> > > or
> > > killing either side means crashing VM then.
> > 
> > Right, although it's unfortunately not the case now as the source is in
> > postcopy-active even though the complete state is still on the source.
> > 
> > > It could be a good thing indeed to have postcopy-device on dest too from
> > > that regard, because having postcopy-device on dest can mark out the small
> > > initial window when dest qemu hasn't yet start to generate new data but
> > > only applying old data (device data, which src also owns a copy).  From
> > > that POV, that indeed does not belong to the point if we define
> > > postcopy-active as above.
> > > 
> > > IOW, also with such definition, setting postcopy-active on dest QEMU right
> > > at the entry of ram load thread (what we do right now..) is too early.
> > > 
> > > > make sure libvirt will only call cont once device data was successfully
> > > > loaded so that we always get a proper result when running cont. But it
> > > 
> > > Do we know an estimate of how much extra downtime this would introduce?
> > > 
> > > We should have discussed this in a previous thread, the issue is if we 
> > > cont
> > > only after device loaded, then dest QEMU may need to wait a while until it
> > > receives the cont from libvirt, that will contribute to the downtime.  It
> > > would best be avoided, or if it's negligible then it's fine too but I'm 
> > > not
> > > sure whether it's guaranteed to be negligible..
> > 
> > We start QEMU with -S so it always needs to wait for cont from libvirt.
> > We wait for postcopy-active on the destination before sending cont. So
> > currently it can arrive while QEMU is still loading device state or when
> > this is already done. I was just suggesting to always wait for the
> > device state to be loaded before sending cont. So in some cases it would
> > arrive a bit later while in other cases nothing would change. It's just
> > a matter of waking up a thread waiting for postcopy-active and sending
> > the command back to QEMU. There's no communication with the other host
> > at this point so I'd expect the difference to be negligible. And as I
> > said depending on how fast device state loading vs transferring
> > migration control from libvirt on the source to the destination we may
> > already be sending cont when QEMU is done.
> 
> Ah OK, I think this is not a major concern, until it is justified to.
> 
> > 
> > But anyway, this would only be helpful if there's a way to actually
> > cancel migration on the source when cont fails.
> > 
> > > If the goal is to make sure libvirt knows what is happening, can it still
> > > relies on the event emitted, in this case, RESUME?  We can also reorg how
> > > postcopy-device and postcopy-active states will be reported on dest, then
> > > they'll help if RESUME is too coarse grained.
> > 
> > The main goal is to make sure we don't end up with vCPUs paused on both
> > sides during a postcopy migration that can't be recovered nor canceled
> > thus effectively crashing the VM.
> 
> Right, I assume that's what Juraj's series is trying to fix.  After this
> series lands, I don't see why it would happen.  But indeed I'm still
> expecting the block drive (including their locks) to behave all fine.

My POSTCOPY_DEVICE series resolves failures during the device state
load, that is up to MIG_CMD_PING that is just before the
MIG_CMD_POSTCOPY_RUN command. However, if the destination fails during
resume (POSTCOPY_RUN), the source machine cannot be recovered even with
this series, it has already received PONG and switched to
postcopy-active.

It is also not possible to move the PING after POSTCOPY_RUN without
changing how the destination interprets the migration stream and the
source would also need to know if destination supports the feature.

> 
> > 
> > > So far, dest QEMU will try to resume the VM after getting RUN command, 
> > > that
> > > is what loadvm_postcopy_handle_run_bh() does, and it will (when 
> > > autostart=1
> > > set): (1) firstly try to activate all block devices, iff it succeeded, (2)
> > > do vm_start(), at the end of which RESUME event will be generated.  So
> > > RESUME currently implies both disk activation success, and vm start 
> > > worked.
> > > 
> > > > may still fail when locking disks fails (not sure if this is the only
> > > > way cont may fail). In this case we cannot cancel the migration on the
> > > 
> > > Is there any known issue with locking disks that dest would fail?  This
> > > really sound like we should have the admin taking a look.
> > 
> > Oh definitely, it would be some kind of an storage access issue on the
> > destination. But we'd like to give the admin an option to actually do
> > anything else than just killing the VM :-) Either by automatically
> > canceling the migration or allowing recovery once storage issues are
> > solved.
> 
> The problem is, if the storage locking stopped working properly, then how
> to guarantee the shared storage itself is working properly?
> 
> When I was replying previously, I was expecting the admin taking a look to
> fix the storage, I didn't expect the VM can still be recovered anymore if
> there's no confidence that the block devices will work all fine.  The
> locking errors to me may imply a block corruption already, or should I not
> see it like that?
> 
> Fundamentally, "crashing the VM" doesn't lose anything from the block POV
> because it's always persistent when synced.  It's almost only about RAM
> that is getting lost, alongside it's about task status, service
> availability, and the part of storage that was not flushed to backends.
> 
> Do we really want to add anything more complex when shared storage has
> locking issues?  Maybe there's known issues on locking that we're 100% sure
> the storage is fine, but only the locking went wrong?
> 
> IIUC, the hope is after this series lands we should close the gap for
> almost all the rest paths that may cause both sides to HALT for a postcopy,
> except for a storage issue with lockings.  But I'm not sure if I missed
> something.
> 
> Thanks,
> 
> -- 
> Peter Xu
> 


Reply via email to