On Thu, Sep 18, 2025 at 04:16:56PM +0200, Juraj Marcin wrote: > If there is no outgoing traffic from the destination side (this can be > caused for example by a workload with no page faults or paused machine), > QEMU has no way of knowing if the connection is still working or not. > The TCP stack doesn't treat no incoming traffic as a sign of a broken > connection. Therefore, QEMU would stay in postcopy-active waiting for > pages indefinitely. > > Also, libvirt might not be aware of a connection dropout between QEMUs, > if libvirt's connection is intact, especially if libvirt daemons are > communicating through some central entity that is managing the migration > and not directly. And to do postcopy migration recovery, libvirt needs > both sides to be in postcopy-paused state.
Whether keepalive timeouts are at the QEMU level or global kernel level, there will always be situations where the timeouts are too long. Apps/admins can have out of band liveliness checks between hosts that detect a problem before the keepalives will trigger and shouldn't have to wait to recover migration, once they have resolved the underlying network issue. There needs to be a way to initiate post-copy recovery regardless of whether we've hit a keepalive timeout. Especially if we can see one QEMU in postcopy-paused, but not the other side, it doesn't appear to make sense to block the recovery process. The virDomainJobCancel command can do a migrate-cancel on the src, but it didn't look like we could do the same on the dst. Unless I've overlooked something, Libvirt needs to gain a way to explicitly force both sides into the postcopy-paused state, and thus be able to immediately initiate recovery. > Alternatively, there also might be an issue with the connection between > libvirt daemons, but not the migration connection. Even if the libvirt > connection fails, the migration is not paused, rather libvirt lets the > migration finish normally. Similarly, if the libvirt connection is > broken up due to, for example, libvirt daemon restart, the ongoing > migration is not paused, but after the libvirt daemon starts again, it > sees an ongoing migration and lets it finish. Whole this is a reliability issue for libvirt, this doesn't have any bearing on migration keepalive timeouts, as we're only concerned about QEMU connections. > Additionally, libvirt uses its own internal keep-alive packets with much > more aggressive timeouts, waiting 5 - 10 seconds idle before sending a > keep-alive packet and then killing the connection if there is no > response in 30 seconds. Yep, this keepalive is very aggressive and has frequently caused problems with libvirt connections being torn down inappropriately. We get away with that because most libvirt APIs don't need to have persistent state over the duration of a connection. The migration APIs are there area where this isn't true, and the keepalives on libvirt conmnections have resulted in us breaking otherwise still functional migrations. IOW, I wouldn't point to libvirt as an illustration of keepalives being free of significant downsides. > I think, if we enable keep-alive in QEMU, but let the default timeouts > be longer, for example idle time of 5 minutes and 15 retries in 1 minute > intervals (which would mean, that connection would be considered broken > after 20 minutes of unsuccessful communication attempts), that would be > an acceptable solution. I'm fine with turning on keepalives on the socket, but IMHO the out of the box behaviour should be to honour the kernel default tunables unless the admin decides they want different behaviour. I'm not seeing a rational for why the kernel defaults should be forceably overridden in QEMU out of the box. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|