On Thu, Sep 18, 2025 at 03:45:21PM +0100, Daniel P. Berrangé wrote: > There needs to be a way to initiate post-copy recovery regardless > of whether we've hit a keepalive timeout. Especially if we can > see one QEMU in postcopy-paused, but not the other side, it > doesn't appear to make sense to block the recovery process. > > The virDomainJobCancel command can do a migrate-cancel on the > src, but it didn't look like we could do the same on the dst. > Unless I've overlooked something, Libvirt needs to gain a way > to explicitly force both sides into the postcopy-paused state, > and thus be able to immediately initiate recovery.
Right, if libvirt can do that then problem should have been solved too. > I'm fine with turning on keepalives on the socket, but IMHO the > out of the box behaviour should be to honour the kernel default > tunables unless the admin decides they want different behaviour. > I'm not seeing a rational for why the kernel defaults should be > forceably overridden in QEMU out of the box. IMHO the rational here is that the socket here is in a special state and for special purpose. So we're not trying to change anything globally for qemu (without knowing what the socket is), but only this specific type of socket that is used for either precopy or postcopy live migrations. It's special because it's always safe to have a more aggresive disconnection, and might be preferred versus very lengthy hangs (if assuming libvirt doesn't yet have way to stop the hang), especially for a postcopy phase. There's also an option that we only have such keepalive timeout setup if a postcopy process is expected (or even only postcopy starts, but maybe that's a slight overkill). For precopy, hang isn't a huge issue because migrate-cancel is always present and functional. Thanks, -- Peter Xu