On Tue, 19 Aug 2025 11:31:03 +0100 Daniel P. Berrangé <berra...@redhat.com> wrote:
> On Mon, Aug 11, 2025 at 10:53:11AM -0300, Fabiano Rosas wrote: > > Lukas Straub <lukasstra...@web.de> writes: > > > > > On Fri, 8 Aug 2025 11:37:23 -0400 > > > Peter Xu <pet...@redhat.com> wrote: > > >> ... > > >> migrate_cancel() should really be an OOB command.. It should be a > > >> superset > > >> of yank features, plus anything migration speficic besides yanking the > > >> channels, for example, when migration thread is blocked in > > >> PRE_SWITCHOVER. > > > > > > Hmm, I think the migration code should handle this properly even if the > > > yank command is used. From the POV of migration, it sees that the > > > connection broke with connection reset. That is the same error as if the > > > other side crashes/is killed or a NAT/stateful firewall in between > > > reboots. > > > > > > > That should all work just fine. After yank or after a detectable network > > failure. The issue here seems to be that the destination recv is hanging > > indefinitely. I don't think we ever played with socket timeout > > configurations, or even switching to non-blocking during the sync. This > > is actually (AFAIK) the first time we get a hang that's not "just" a > > synchronization issue in the migration code. > > Based on the stack trace, whether the socket is blocking or not isn't a > problem - QEMU is stuck in a sem_wait call that will delay the coroutine, > and thus the thread, indefinitely. IMHO the semaphore usage needs to be > removed in favour of a synchronization mechanism that can integrate with > event loop such that the coroutine does not block. > I don't think that is an issue. The semaphore is just there to sync with the multifd threads, which are in turn blocking on recvmsg. Without multifd the main thread would hang in recvmsg as well in this scenario. Best Regards, Lukas Straub
pgpuaI0DtAhf2.pgp
Description: OpenPGP digital signature