Hi Daniel and all, On 2025-09-15 19:23, Daniel P. Berrangé wrote: > On Fri, Sep 12, 2025 at 11:02:12AM -0400, Peter Xu wrote: > > On Fri, Sep 12, 2025 at 11:20:01AM +0100, Daniel P. Berrangé wrote: > > > On Wed, Sep 10, 2025 at 12:36:44PM -0400, Peter Xu wrote: > > > > On Wed, Sep 10, 2025 at 08:10:57AM +0100, Daniel P. Berrangé wrote: > > > > > On Tue, Sep 09, 2025 at 05:58:49PM -0400, Peter Xu wrote: > > > > > > On Tue, Sep 09, 2025 at 04:09:23PM +0100, Daniel P. Berrangé wrote: > > > > > > > On Tue, Sep 09, 2025 at 05:01:24PM +0200, Juraj Marcin wrote: > > > > > > > > From: Juraj Marcin <jmar...@redhat.com> > > > > > > > > > > > > > > > > Usual system defaults for TCP keep-alive options are too long > > > > > > > > for > > > > > > > > migration workload. On Linux, a TCP connection waits idle for 2 > > > > > > > > hours > > > > > > > > before it starts checking if the connection is not broken. > > > > > > > > > > > > > > > > Now when InetSocketAddress supports keep-alive options [1], > > > > > > > > this patch > > > > > > > > applies migration specific defaults if they are not supplied by > > > > > > > > the user > > > > > > > > or the management software. With these defaults, a migration > > > > > > > > TCP stream > > > > > > > > waits idle for 1 minute and then sends 5 TCP keep-alive packets > > > > > > > > in 30 > > > > > > > > second interval before considering the connection as broken. > > > > > > > > > > > > > > > > System defaults can be still used by explicitly setting these > > > > > > > > parameters > > > > > > > > to 0. > > > > > > > > > > > > > > IMHO this is not a good idea. This is a very short default, which > > > > > > > may be fine for the scenario where your network conn is > > > > > > > permanently > > > > > > > dead, but it is going to cause undesirable failures when the > > > > > > > network > > > > > > > conn is only temporarily dead. > > > > > > > > > > > > > > Optimizing defaults for temporary outages is much more preferrable > > > > > > > as that maximises reliability of migration. In the case of > > > > > > > permanent > > > > > > > outages, it is already possible to tear down the connection > > > > > > > without > > > > > > > waiting for a keep-alive timeout, and liveliness checks can also > > > > > > > be > > > > > > > perform by the mgmt app at a higher level too. The TCP keepalives > > > > > > > are just an eventual failsafe, and having those work on a long > > > > > > > timeframe is OK. > > > > > > > > > > > > For precopy it looks fine indeed, because migrate_cancel should > > > > > > always work > > > > > > on src if src socket hanged, and even if dest QEMU socket hanged, > > > > > > it can > > > > > > simply be killed if src QEMU can be gracefully cancelled and rolled > > > > > > back to > > > > > > RUNNING, disregarding the socket status on dest QEMU. > > > > > > > > > > > > For postcopy, we could still use migrate_pause to enforce src > > > > > > shutdown(). > > > > > > Initially I thought we have no way of doing that for dest QEMU, but > > > > > > I just > > > > > > noticed two years ago I added that to dest QEMU for migrate_paused > > > > > > when > > > > > > working on commit f8c543e808f20b.. So looks like that part is > > > > > > covered too, > > > > > > so that if dest QEMU socket hanged we can also kick it out. > > > > > > > > > > > > I'm not 100% sure though, on whether shutdown() would always be > > > > > > able to > > > > > > successfully kick out the hanged socket while the keepalive is > > > > > > ticking. Is > > > > > > it guaranteed? > > > > > > > > > > I don't know about shutdown(), but close() certainly works. If > > > > > shutdown() > > > > > is not sufficient, then IMHO the migration code would need the > > > > > ability to > > > > > use close() to deal with this situation. > > > > > > > > > > > > > > > > I also am not sure if that happens, whether libvirt would > > > > > > automatically do > > > > > > that, or provide some way so the user can trigger that. The goal > > > > > > IIUC here > > > > > > is we shouldn't put user into a situation where the migration > > > > > > hanged but > > > > > > without any way to either cancel or recover. With the default > > > > > > values Juraj > > > > > > provided here, it makes sure the hang won't happen more than a few > > > > > > minutes, > > > > > > which sounds like a sane timeout value. > > > > > > > > > > Sufficient migration QMP commands should exist to ensure migration can > > > > > always be cancelled. Short keepalive timeouts should not be considered > > > > > a solution to any gaps in that respect. > > > > > > > > > > Also there is yank, but IMHO apps shouldn't have to rely on yank - I > > > > > see > > > > > yank as a safety net for apps to workaround limitations in QEMU. > > > > > > > > The QMP facility looks to be all present, which is migrate-cancel and > > > > migrate-pause mentioned above. > > > > > > > > For migrate_cancel (of precopy), is that Ctrl-C of "virsh migrate"? > > > > > > > > Does libvirt exposes migrate_pause via any virsh command? IIUC that's > > > > the > > > > only official way of pausing a postcopy VM on either side. I also > > > > agree we > > > > shouldn't make yank the official tool to use. > > > > > > virsh will call virDomainAbortJob when Ctrl-C is done to a 'migrate' > > > command. > > > > > > virDomainAbortJob will call migrate-cancel for pre-copy, or > > > 'migrate-pause' for post-copy. > > > > Would it call "migrate-pause" on both sides? > > Not 100% sure, but with virDomainAbortJob I think libvirt only calls > migrate-pause on the source host. > > > I believe the problem we hit was, when during postcopy and the NIC was > > misfunctioning, src fell into postcopy-paused successfully but dest didn't, > > stuck in postcopy-active. > > If something has interrupted src<->dst host comms for QEMU it may well > impact libvirt <-> libvirt comms too, unless migration was being done > over a separate NIC than the mgmt LAN. IOW, it may be impossible for > libvirt to call migrate-pause on both sides, at least not until the > NIC problem has been resolved. > > > We'll want to make sure both sides to be kicked into paused stage to > > recover. Otherwise dest can hang in the stage for hours until the watchdog > > timeout triggers. > > Once the network problem has been resolved, then it ought to be possible > to get libvirt to issue 'migrate-pause' on both hosts, and thus be able > to recover. > > Possibly the act of starting migration recovery in libvirt should attempt > to issue 'migrate-pause' to cleanup the previously running migration if > it is still in the stuck state. > > > > > > > > > > > > > OTOH, the default timeouts work without changing libvirt, making sure > > > > the > > > > customers will not be stuck in a likely-failing network for hours > > > > without > > > > providing a way to properly detach and recover when it's wanted. > > > > > > "timeouts work" has the implicit assumpton that the only reason a > > > timeout will fire is due to a unrecoverable situation. IMHO that > > > assumption is not valid. > > > > I agree adjusting timeout is not the best. > > > > If we can have solid way to kick two sides out, I think indeed we don't > > need to change the timeout. > > > > If not, we may still need to provide a way to allow user to try continue > > when the user found that the network is behaving abnormal. > > > > Here adjusting timeout is slightly better than any adhoc socket timeout > > that we'll adjust: it's the migration timeout, and we only have two cases: > > (1) precopy, which is ok to fail and retried, (2) postcopy, which is also > > ok to fail and recovered. > > Fail & retry/recover is not without cost / risk though. Users can have > successful migrations that are many hours long when dealing with big > VMs. IOW, returning to the start of pre-copy could be a non-trivial > time delay. > > Consider if the reason for the migration is to evacuate workloads off > a host that is suffering technical problems. It could well be that > periodic unexpected network outages are what is triggering the need > to evacuate workloads. If we timeout a migration with keepalives they > may never be able to get through a migration op quickly enough, or > they can be delayed such that the host has a fatal error loosing the > workload before the retried migration is complete. > > IMHO, once a migration has been started we should not proactively > interrupt that with things like keepalives, unless the admin made a > concious decision they wanted that behaviour enabled. >
maybe I should reiterate on the original problem this is trying to solve. I have also talked to @jdenemar how libvirt currently handles such things (but if I still missed something, please correct me). If there is no outgoing traffic from the destination side (this can be caused for example by a workload with no page faults or paused machine), QEMU has no way of knowing if the connection is still working or not. The TCP stack doesn't treat no incoming traffic as a sign of a broken connection. Therefore, QEMU would stay in postcopy-active waiting for pages indefinitely. Also, libvirt might not be aware of a connection dropout between QEMUs, if libvirt's connection is intact, especially if libvirt daemons are communicating through some central entity that is managing the migration and not directly. And to do postcopy migration recovery, libvirt needs both sides to be in postcopy-paused state. Alternatively, there also might be an issue with the connection between libvirt daemons, but not the migration connection. Even if the libvirt connection fails, the migration is not paused, rather libvirt lets the migration finish normally. Similarly, if the libvirt connection is broken up due to, for example, libvirt daemon restart, the ongoing migration is not paused, but after the libvirt daemon starts again, it sees an ongoing migration and lets it finish. Additionally, libvirt uses its own internal keep-alive packets with much more aggressive timeouts, waiting 5 - 10 seconds idle before sending a keep-alive packet and then killing the connection if there is no response in 30 seconds. I think, if we enable keep-alive in QEMU, but let the default timeouts be longer, for example idle time of 5 minutes and 15 retries in 1 minute intervals (which would mean, that connection would be considered broken after 20 minutes of unsuccessful communication attempts), that would be an acceptable solution. Finally, normal TCP packets already have a default system timeout and limited retransmission attempts, so if a TCP packet cannot be delivered in 20 minutes (or even less)[1], the whole connection times out and the migration is paused/cancelled. So, this patch doesn't really introduce any new or uncommon behavior, just merely tries to make both scenarios (with dst->src traffic and without dst->src traffic) behave similarly - they would time out after some time. [1] Linux man-pages, tcp(7) on TCP timeouts: "Otherwise, failure may take up to 20 minutes with the current system defaults in a normal WAN environment." Best regards, Juraj Marcin > With regards, > Daniel > -- > |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| > |: https://libvirt.org -o- https://fstop138.berrange.com :| > |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| >