Prasad Pandit <ppan...@redhat.com> writes: > Hi, > > On Tue, 11 Mar 2025 at 01:28, Fabiano Rosas <faro...@suse.de> wrote: >> They occur when cleanup code is allowed to run when resources have not >> yet been allocated or while the resources are still being accessed. >> >> Having the shutdown routine at a single point when it's clear everything >> else is ready for shutdown helps not only to avoid those issues, but >> also when investigating them. Otherwise you'll have the same code >> running at (potentially) completely different points in time and one of >> those times the system might be in an unexpected state. > > * I see. It's not clear when this would happen in a production deployment. > === > if (migrate_multifd()) { > multifd_send_shutdown(); <= [1] > } > > postcopy_start(...) <= [2] > === > > * There was another scenario regarding multifd shutdown as: the EOF or > shutdown message sent via [1] on each multifd connection reaches the > destination _later_ than the Postcopy phase start via [2]. And this > may result in multifd_recv_threads on the destination overwriting > memory pages written by postcopy thread, thus corrupting the guest > state.
Isn't that the point? To add a sync for this which would allow the shutdown to not be added? > > * Do we have any bugs/issues where these above things happened? To see > the real circumstances under which it happened? > We do. They don't come with a description of the circumstances. You're lucky if you get a coredump. You can peruse `git log migration/multifd`, I'd say most of the work in the recent years has been solving concurrency issues. > * There seems to be some disconnect between the kind of scenarios we > are considering and the minimal requirements for live migrations: a > stable network with real good bandwidth. There's no such requirement. Besides, the topic is not failed migrations due to lack of resources. We're talking about correctness issues that are hard to spot. Those should always be fixed when found, independently of what the production environment is expected to be.