Hi, On Tue, 11 Mar 2025 at 01:28, Fabiano Rosas <faro...@suse.de> wrote: > They occur when cleanup code is allowed to run when resources have not > yet been allocated or while the resources are still being accessed. > > Having the shutdown routine at a single point when it's clear everything > else is ready for shutdown helps not only to avoid those issues, but > also when investigating them. Otherwise you'll have the same code > running at (potentially) completely different points in time and one of > those times the system might be in an unexpected state.
* I see. It's not clear when this would happen in a production deployment. === if (migrate_multifd()) { multifd_send_shutdown(); <= [1] } postcopy_start(...) <= [2] === * There was another scenario regarding multifd shutdown as: the EOF or shutdown message sent via [1] on each multifd connection reaches the destination _later_ than the Postcopy phase start via [2]. And this may result in multifd_recv_threads on the destination overwriting memory pages written by postcopy thread, thus corrupting the guest state. * Do we have any bugs/issues where these above things happened? To see the real circumstances under which it happened? * There seems to be some disconnect between the kind of scenarios we are considering and the minimal requirements for live migrations: a stable network with real good bandwidth. If we test live migration with guest dirtying RAM to the tune of 64M/128M/256M/512M bytes, that assumes/implies that the network bandwidth is much more than 512Mbps, no? Thank you. --- - Prasad