Hi,

On Tue, 11 Mar 2025 at 01:28, Fabiano Rosas <faro...@suse.de> wrote:
> They occur when cleanup code is allowed to run when resources have not
> yet been allocated or while the resources are still being accessed.
>
> Having the shutdown routine at a single point when it's clear everything
> else is ready for shutdown helps not only to avoid those issues, but
> also when investigating them. Otherwise you'll have the same code
> running at (potentially) completely different points in time and one of
> those times the system might be in an unexpected state.

* I see. It's not clear when this would happen in a production deployment.
===
     if (migrate_multifd()) {
          multifd_send_shutdown();  <= [1]
     }

     postcopy_start(...)  <= [2]
===

* There was another scenario regarding multifd shutdown as: the EOF or
shutdown message sent via [1] on each multifd connection reaches the
destination _later_ than the Postcopy phase start via [2]. And this
may result in multifd_recv_threads on the destination overwriting
memory pages written by postcopy thread, thus corrupting the guest
state.

* Do we have any bugs/issues where these above things happened? To see
the real circumstances under which it happened?

* There seems to be some disconnect between the kind of scenarios we
are considering and the minimal requirements for live migrations: a
stable network with real good bandwidth. If we test live migration
with guest dirtying RAM to the tune of 64M/128M/256M/512M bytes, that
assumes/implies that the network bandwidth is much more than 512Mbps,
no?

Thank you.
---
  - Prasad


Reply via email to