Re: [PATCH] migration: Apply migration specific keep-alive defaults to inet socket

Daniel P . Berrangé Thu, 18 Sep 2025 07:46:15 -0700

On Thu, Sep 18, 2025 at 04:16:56PM +0200, Juraj Marcin wrote:
> If there is no outgoing traffic from the destination side (this can be
> caused for example by a workload with no page faults or paused machine),
> QEMU has no way of knowing if the connection is still working or not.
> The TCP stack doesn't treat no incoming traffic as a sign of a broken
> connection. Therefore, QEMU would stay in postcopy-active waiting for
> pages indefinitely.
> 
> Also, libvirt might not be aware of a connection dropout between QEMUs,
> if libvirt's connection is intact, especially if libvirt daemons are
> communicating through some central entity that is managing the migration
> and not directly. And to do postcopy migration recovery, libvirt needs
> both sides to be in postcopy-paused state.


Whether keepalive timeouts are at the QEMU level or global kernel
level, there will always be situations where the timeouts are too
long. Apps/admins can have out of band liveliness checks between
hosts that detect a problem before the keepalives will trigger
and shouldn't have to wait to recover migration, once they have
resolved the underlying network issue.

There needs to be a way to initiate post-copy recovery regardless
of whether we've hit a keepalive timeout. Especially if we can
see one QEMU in postcopy-paused, but not the other side, it
doesn't appear to make sense to block the recovery process.

The virDomainJobCancel command can do a migrate-cancel on the
src, but it didn't look like we could do the same on the dst.
Unless I've overlooked something, Libvirt needs to gain a way
to explicitly force both sides into the postcopy-paused state,
and thus be able to immediately initiate recovery.

> Alternatively, there also might be an issue with the connection between
> libvirt daemons, but not the migration connection. Even if the libvirt
> connection fails, the migration is not paused, rather libvirt lets the
> migration finish normally. Similarly, if the libvirt connection is
> broken up due to, for example, libvirt daemon restart, the ongoing
> migration is not paused, but after the libvirt daemon starts again, it
> sees an ongoing migration and lets it finish.

Whole this is a reliability issue for libvirt, this doesn't have
any bearing on migration keepalive timeouts, as we're only concerned
about QEMU connections.

> Additionally, libvirt uses its own internal keep-alive packets with much
> more aggressive timeouts, waiting 5 - 10 seconds idle before sending a
> keep-alive packet and then killing the connection if there is no
> response in 30 seconds.

Yep, this keepalive is very aggressive and has frequently caused
problems with libvirt connections being torn down inappropriately.
We get away with that because most libvirt APIs don't need to have
persistent state over the duration of a connection. The migration
APIs are there area where this isn't true, and the keepalives on
libvirt conmnections have resulted in us breaking otherwise still
functional migrations. IOW, I wouldn't point to libvirt as an
illustration of keepalives being free of significant downsides.

> I think, if we enable keep-alive in QEMU, but let the default timeouts
> be longer, for example idle time of 5 minutes and 15 retries in 1 minute
> intervals (which would mean, that connection would be considered broken
> after 20 minutes of unsuccessful communication attempts), that would be
> an acceptable solution.

I'm fine with turning on keepalives on the socket, but IMHO the
out of the box behaviour should be to honour the kernel default
tunables unless the admin decides they want different behaviour.
I'm not seeing a rational for why the kernel defaults should be
forceably overridden in QEMU out of the box.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH] migration: Apply migration specific keep-alive defaults to inet socket

Reply via email to