On Tue, Aug 19, 2025 at 01:07:28PM +0100, Daniel P. Berrangé wrote:
> On Tue, Aug 19, 2025 at 02:03:26PM +0200, Lukas Straub wrote:
> > On Tue, 19 Aug 2025 11:31:03 +0100
> > Daniel P. Berrangé <berra...@redhat.com> wrote:
> > 
> > > On Mon, Aug 11, 2025 at 10:53:11AM -0300, Fabiano Rosas wrote:
> > > > Lukas Straub <lukasstra...@web.de> writes:
> > > >   
> > > > > On Fri, 8 Aug 2025 11:37:23 -0400
> > > > > Peter Xu <pet...@redhat.com> wrote:
> > > > >> ...
> > > > >> migrate_cancel() should really be an OOB command..  It should be a 
> > > > >> superset
> > > > >> of yank features, plus anything migration speficic besides yanking 
> > > > >> the
> > > > >> channels, for example, when migration thread is blocked in 
> > > > >> PRE_SWITCHOVER.  
> > > > >
> > > > > Hmm, I think the migration code should handle this properly even if 
> > > > > the
> > > > > yank command is used. From the POV of migration, it sees that the
> > > > > connection broke with connection reset. That is the same error as if 
> > > > > the
> > > > > other side crashes/is killed or a NAT/stateful firewall in between
> > > > > reboots.
> > > > >  
> > > > 
> > > > That should all work just fine. After yank or after a detectable network
> > > > failure. The issue here seems to be that the destination recv is hanging
> > > > indefinitely. I don't think we ever played with socket timeout
> > > > configurations, or even switching to non-blocking during the sync. This
> > > > is actually (AFAIK) the first time we get a hang that's not "just" a
> > > > synchronization issue in the migration code.  
> > > 
> > > Based on the stack trace, whether the socket is blocking or not isn't a
> > > problem - QEMU is stuck in a  sem_wait call that will delay the coroutine,
> > > and thus the thread, indefinitely. IMHO the semaphore usage needs to be
> > > removed in favour of a synchronization mechanism that can integrate with
> > > event loop such that the coroutine does not block.
> > > 
> > 
> > I don't think that is an issue. The semaphore is just there to sync
> > with the multifd threads, which are in turn blocking on recvmsg.
> > 
> > Without multifd the main thread would hang in recvmsg as well in this
> > scenario.
> 
> If it is using blocking I/O that would hang, but that's another thing
> that should not be done.  The QIOChannel code supports using non-blocking
> sockets in a blocking manner by yielding the coroutine.

The thing is multifd feature, as a whole, is done with a thread-based
model.  It doesn't have any other coroutines to yield, AFAIU..

Instead, I do want to make the precopy load on dest QEMU also happen in a
separate thread instead of the main thread at some point.

I did try it once but it isn't trivial.  Unlike savevm, there're quite some
assumptions that the bql will be around when loading the VM.  But maybe I
should keep trying that until we figure out all such spots and see whether
we can still move it out at some point.

If that'll work some day, then multifd sync on dest qemu will by default
happen without BQL.

Thanks,

-- 
Peter Xu


Reply via email to