[Sorry to respond late on the real meat of this series..] On Thu, Aug 07, 2025 at 01:49:08PM +0200, Juraj Marcin wrote: > When postcopy migration starts, the source side sends all > non-postcopiable device data in one package command and immediately > transitions to a "postcopy-active" state. However, if the destination > side fails to load the device data or crashes during it, the source side > stays paused indefinitely with no way of recovery. > > This series introduces a new "postcopy-setup" state during which the > destination side is guaranteed to not been started yet and, the source > side can recover and resume and the destination side gracefully exit. > > Key element of this feature is isolating the postcopy-run command from > non-postcopiable data and sending it only after the destination side > acknowledges, that it has loaded all devices and is ready to be started. > This is necessary, as once the postcopy-run command is sent, the source > side cannot be sure if the destination is running or not and if it can > safely resume in case of a failure. > > Reusing existing ping/pong messages was also considered, PING 3 is right > before the postcopy-run command, but there are two reasons why the PING > 3 message might not be delivered to the source side: > > 1. destination machine failed, it is not running, and the source side > can resume, > 2. there is a network failure, so PING 3 delivery fails, but until until > TCP or other transport times out, the destination could process the > postcopy-run command and start, in which case the source side cannot > resume. > > Furthermore, this series contains two more patches required for the > implementation of this feature, that make the listen thread joinable for > graceful cleanup and detach it explicitly otherwise, and one patch > fixing state transitions inside postcopy_start(). > > Such (or similar) feature could be potentially useful also for normal > (only precopy) migration with return-path, to prevent issues when > network failure happens just as the destination side shuts the > return-path. When I tested such scenario (by filtering out the SHUT > command), the destination started and reported successful migration, > while the source side reported failed migration and tried to resume, but > exited as it failed to gain disk image file lock. > > Another suggestion from Peter, that I would like to discuss, is that > instead of introducing a new state, we could move the boundary between > "device" and "postcopy-active" states to when the postcopy-run command > is actually sent (in this series boundary of "postcopy-setup" and > "postcopy-active"), however, I am not sure if such change would not have > any unwanted implications.
Yeah, after reading patch 4, I still want to check with you on whether this is possible, on a simpler version of such soluion. As we discussed offlist, looks like there's no perfect solution for synchronizing between src <-> dst on this matter. No matter what is the last message to be sent, either precopy has RP_SHUT, or relying on 3rd/4th PONG, or RUN_ACK, it might get lost in a network failure. IIUC it means we need to face the situation of split brain. The link can simply be broken at any time. The ultimate result is still better when two VMs will be halted when split brain, but then IMHO we'll need to justify whether that complexity would be worthwhile for changing "both sides active" -> "both sides halted" when it happened. If we go back to the original request of why we decided to work on this: it was more or less a feature parity request on postcopy against precopy, so that when device states loading failed during switchover, postcopy can also properly get cancelled rather than hanging. Precopy can do that, we wished postcopy can do at least the same. Could we still explore the simpler idea and understand better on the gap between the two? E.g. relying on the 3rd/4th PONG returned from the dest QEMU to be the ACK message. Something like: - Start postcopy... - Send the postcopy wholesale package (which includes e.g. whole device states dump, PING-3, RUN), as before. - Instead of going directly POSTCOPY_ACTIVE, we stay in DEVICE, but we start to allow iterations to resolve page faults while keep moving pages. - If... - we received the 3rd PONG, we _assume_ the device states are loaded successfully and the RUN must be processed, src QEMU moves to POSTCOPY_ACTIVE. - we noticed network failure before 3rd PONG, we _assume_ destination failed to load or crashed, src QEMU fails the migration (DEVICE -> FAILED) and try to restart VM on src. This might be a much smaller change, and it might not need any change from dest qemu or stream protocol. It means, if it works (even if imperfect) it'll start to work for old VMs too as long as they got migrated to the new QEMU, and we get this postcopy parity feature asap instead of requesting user to cold-restart the VM with a newer machine type. Would this be a better possible trade-off? Thanks, -- Peter Xu