On Tue, Nov 29, 2022 at 05:28:26PM -0300, Leonardo Bras Soares Passos wrote: > Hello Peter,
Leo, > > On Thu, Nov 24, 2022 at 1:04 PM Peter Xu <pet...@redhat.com> wrote: > > > > On Wed, Nov 09, 2022 at 02:56:29AM -0300, Leonardo Bras wrote: > > > diff --git a/migration/savevm.c b/migration/savevm.c > > > index a0cdb714f7..250caff7f4 100644 > > > --- a/migration/savevm.c > > > +++ b/migration/savevm.c > > > @@ -1889,6 +1889,8 @@ static void *postcopy_ram_listen_thread(void > > > *opaque) > > > exit(EXIT_FAILURE); > > > } > > > > > > + migration_load_cleanup(); > > > > It's a bit weird to call multifd-load-clean in a listen phase.. > > I agree. > > > > > How about moving it right above > > trace_process_incoming_migration_co_postcopy_end_main()? Then the new > > helper can also be static. > > Seems a nice Idea to have this function to be static. > > We have to guarantee this is run after the migration finished, but > before migration_incoming_state_destroy(). IIUC it doesn't need to be when migration finished. It should be fine as long as we finished precopy phase, and that's what the migration coroutine does, iiuc. The thing is postcopy doesn't use multifd at all, so logically it can be released before postcopy starts. Actually, IMHO it'll be safer to do it like that, just to make sure we won't accidentally receive multifd pages _after_ postcopy starts, because that'll be another more severe and hard to debug issue since the guest can see partial copied pages from multifd recv channels. > > You suggested calling it right above of > trace_process_incoming_migration_co_postcopy_end_main(), which git > grep pointed me to an if clause in process_incoming_migration_co(). > If I got the location correctly, it would not help: this coroutine is > ran just after the VM went to the target host, and not when the > migration finished. > > If we are using multifd channels, this will break the migration with > segmentation fault (SIGSEGV), since the channels have not finished > sending yet. If this happens, then I had a feeling that there's something else that needs syncs. As I discussed above, we should make sure multifd pages all landed before we start vcpu threads. Said that, now I think I'm not against your original proposal to fix this immediate crash. However I am still wondering whether we really should disable multifd with postcopy, as there seem to be still a few missing pieces even to enable multifd during precopy-only. Thanks, -- Peter Xu