[Sorry to respond late on the real meat of this series..]

On Thu, Aug 07, 2025 at 01:49:08PM +0200, Juraj Marcin wrote:
> When postcopy migration starts, the source side sends all
> non-postcopiable device data in one package command and immediately
> transitions to a "postcopy-active" state. However, if the destination
> side fails to load the device data or crashes during it, the source side
> stays paused indefinitely with no way of recovery.
> 
> This series introduces a new "postcopy-setup" state during which the
> destination side is guaranteed to not been started yet and, the source
> side can recover and resume and the destination side gracefully exit.
> 
> Key element of this feature is isolating the postcopy-run command from
> non-postcopiable data and sending it only after the destination side
> acknowledges, that it has loaded all devices and is ready to be started.
> This is necessary, as once the postcopy-run command is sent, the source
> side cannot be sure if the destination is running or not and if it can
> safely resume in case of a failure.
> 
> Reusing existing ping/pong messages was also considered, PING 3 is right
> before the postcopy-run command, but there are two reasons why the PING
> 3 message might not be delivered to the source side:
> 
> 1. destination machine failed, it is not running, and the source side
>    can resume,
> 2. there is a network failure, so PING 3 delivery fails, but until until
>    TCP or other transport times out, the destination could process the
>    postcopy-run command and start, in which case the source side cannot
>    resume.
> 
> Furthermore, this series contains two more patches required for the
> implementation of this feature, that make the listen thread joinable for
> graceful cleanup and detach it explicitly otherwise, and one patch
> fixing state transitions inside postcopy_start().
> 
> Such (or similar) feature could be potentially useful also for normal
> (only precopy) migration with return-path, to prevent issues when
> network failure happens just as the destination side shuts the
> return-path. When I tested such scenario (by filtering out the SHUT
> command), the destination started and reported successful migration,
> while the source side reported failed migration and tried to resume, but
> exited as it failed to gain disk image file lock.
> 
> Another suggestion from Peter, that I would like to discuss, is that
> instead of introducing a new state, we could move the boundary between
> "device" and "postcopy-active" states to when the postcopy-run command
> is actually sent (in this series boundary of "postcopy-setup" and
> "postcopy-active"), however, I am not sure if such change would not have
> any unwanted implications.

Yeah, after reading patch 4, I still want to check with you on whether this
is possible, on a simpler version of such soluion.

As we discussed offlist, looks like there's no perfect solution for
synchronizing between src <-> dst on this matter.  No matter what is the
last message to be sent, either precopy has RP_SHUT, or relying on 3rd/4th
PONG, or RUN_ACK, it might get lost in a network failure.

IIUC it means we need to face the situation of split brain. The link can
simply be broken at any time.  The ultimate result is still better when two
VMs will be halted when split brain, but then IMHO we'll need to justify
whether that complexity would be worthwhile for changing "both sides
active" -> "both sides halted" when it happened.

If we go back to the original request of why we decided to work on this: it
was more or less a feature parity request on postcopy against precopy, so
that when device states loading failed during switchover, postcopy can also
properly get cancelled rather than hanging.  Precopy can do that, we wished
postcopy can do at least the same.

Could we still explore the simpler idea and understand better on the gap
between the two?  E.g. relying on the 3rd/4th PONG returned from the dest
QEMU to be the ACK message.

Something like:

  - Start postcopy...

  - Send the postcopy wholesale package (which includes e.g. whole device
    states dump, PING-3, RUN), as before.

  - Instead of going directly POSTCOPY_ACTIVE, we stay in DEVICE, but we
    start to allow iterations to resolve page faults while keep moving
    pages.

  - If...

    - we received the 3rd PONG, we _assume_ the device states are loaded
      successfully and the RUN must be processed, src QEMU moves to
      POSTCOPY_ACTIVE.

    - we noticed network failure before 3rd PONG, we _assume_ destination
      failed to load or crashed, src QEMU fails the migration (DEVICE ->
      FAILED) and try to restart VM on src.

This might be a much smaller change, and it might not need any change from
dest qemu or stream protocol.

It means, if it works (even if imperfect) it'll start to work for old VMs
too as long as they got migrated to the new QEMU, and we get this postcopy
parity feature asap instead of requesting user to cold-restart the VM with
a newer machine type.

Would this be a better possible trade-off?

Thanks,

-- 
Peter Xu


Reply via email to