Avihai Horon <avih...@nvidia.com> writes: > On 29/01/2024 16:34, Fabiano Rosas wrote: >> External email: Use caution opening links or attachments >> >> >> Avihai Horon <avih...@nvidia.com> writes: >> >>> Currently, multifd channels are created asynchronously without waiting >>> for their creation -- migration simply proceeds and may wait in >>> multifd_send_sync_main(), which is called by ram_save_setup(). This >>> hides in it some race conditions which can cause an unexpected behavior >>> if some channels creation fail. >>> >>> For example, the following scenario of multifd migration with two >>> channels, where the first channel creation fails, will end in a >>> segmentation fault (time advances from top to bottom): >> Is this reproducible? Or just observable at least. > > Yes, though I had to engineer it a bit: > 1. Run migration with two multifd channels and fail creation of the two > channels (e.g., by changing the address they are connecting to). > 2. Add sleep(3) in multifd_send_sync_main() before we loop through the > channels and check p->quit. > 3. Add sleep(5) only for the second multifd channel connect thread so > its connection is delayed and runs last.
Ok, well, that's something at least. I'll try to reproduce it so we can keep track of it. >> I acknowledge the situation you describe, but with multifd there's >> usually an issue in cleanup paths. Let's make sure we flushed those out >> before adding this new semaphore. > > Indeed, I was not keen on adding yet another semaphore either. > I think there are multiple bugs here, some of them overlap and some don't. > There is also your and Peter's previous work that I was not aware of to > fix those and to clean up the code. > > Maybe we can take it one step at a time, pushing your series first, > cleaning the code and fixing some bugs. > Then we can see what bugs are left (if any) and fix them. It might even > be easier to fix after the cleanups. > >> This is similar to an issue Peter was addressing where we missed calling >> multifd_send_termiante_threads() in the multifd_channel_connect() path: >> >> patch 4 in this >> https://lore.kernel.org/r/20231022201211.452861-1-pet...@redhat.com > > What issue are you referring here? Can you elaborate? Oh, I just realised that series doesn't address any particular bug. But my point is that including a call to multifd_send_terminate_threads() at new_send_channel_cleanup might be all that's needed because that has code to cause the channels and the migration thread to end. > The main issue I am trying to fix in my patch is that we don't wait for > all multifd channels to be created/error out before tearing down > multifd resources in mulitfd_save_cleanup(). Ok, let me take a step back and ask why is this not solved by multifd_save_cleanup() -> qemu_thread_join()? I see you moved p->running=true to *after* the thread creation in patch 4. That will always leave a gap where p->running == false but the thread is already running. > >>> Thread | Code execution >>> ------------------------------------------------------------------------ >>> Multifd 1 | >>> | multifd_new_send_channel_async (errors and quits) >>> | multifd_new_send_channel_cleanup >>> | >>> Migration thread | >>> | qemu_savevm_state_setup >>> | ram_save_setup >>> | multifd_send_sync_main >>> | (detects Multifd 1 error and quits) >>> | [...] >>> | migration_iteration_finish >>> | migrate_fd_cleanup_schedule >>> | >>> Main thread | >>> | migrate_fd_cleanup >>> | multifd_save_cleanup (destroys Multifd 2 resources) >>> | >>> Multifd 2 | >>> | multifd_new_send_channel_async >>> | (accesses destroyed resources, segfault) >>> >>> In another scenario, migration can hang indefinitely: >>> 1. Main migration thread reaches multifd_send_sync_main() and waits on >>> the semaphores. >>> 2. Then, all multifd channels creation fails, so they post the >>> semaphores and quit. >>> 3. Main migration channel will not identify the error, proceed to send >>> pages and will hang. >>> >>> Fix it by waiting for all multifd channels to be created before >>> proceeding with migration. >>> >>> Signed-off-by: Avihai Horon <avih...@nvidia.com> >>> --- >>> migration/multifd.h | 3 +++ >>> migration/migration.c | 1 + >>> migration/multifd.c | 34 +++++++++++++++++++++++++++++++--- >>> migration/ram.c | 7 +++++++ >>> 4 files changed, 42 insertions(+), 3 deletions(-) >>> >>> diff --git a/migration/multifd.h b/migration/multifd.h >>> index 35d11f103c..87a64e0a87 100644 >>> --- a/migration/multifd.h >>> +++ b/migration/multifd.h >>> @@ -23,6 +23,7 @@ void multifd_recv_new_channel(QIOChannel *ioc, Error >>> **errp); >>> void multifd_recv_sync_main(void); >>> int multifd_send_sync_main(void); >>> int multifd_queue_page(RAMBlock *block, ram_addr_t offset); >>> +int multifd_send_channels_created(void); >>> >>> /* Multifd Compression flags */ >>> #define MULTIFD_FLAG_SYNC (1 << 0) >>> @@ -86,6 +87,8 @@ typedef struct { >>> /* multifd flags for sending ram */ >>> int write_flags; >>> >>> + /* Syncs channel creation and migration thread */ >>> + QemuSemaphore create_sem; >>> /* sem where to wait for more work */ >>> QemuSemaphore sem; >>> /* syncs main thread and channels */ >>> diff --git a/migration/migration.c b/migration/migration.c >>> index 9c769a1ecd..d81d96eaa5 100644 >>> --- a/migration/migration.c >>> +++ b/migration/migration.c >>> @@ -3621,6 +3621,7 @@ void migrate_fd_connect(MigrationState *s, Error >>> *error_in) >>> error_report_err(local_err); >>> migrate_set_state(&s->state, MIGRATION_STATUS_SETUP, >>> MIGRATION_STATUS_FAILED); >>> + multifd_send_channels_created(); >>> migrate_fd_cleanup(s); >>> return; >>> } >>> diff --git a/migration/multifd.c b/migration/multifd.c >>> index 564e911b6c..f0c216f4f9 100644 >>> --- a/migration/multifd.c >>> +++ b/migration/multifd.c >>> @@ -538,6 +538,7 @@ void multifd_save_cleanup(void) >>> multifd_send_channel_destroy(p->c); >>> p->c = NULL; >>> qemu_mutex_destroy(&p->mutex); >>> + qemu_sem_destroy(&p->create_sem); >>> qemu_sem_destroy(&p->sem); >>> qemu_sem_destroy(&p->sem_sync); >>> g_free(p->name); >>> @@ -766,6 +767,29 @@ out: >>> return NULL; >>> } >>> >>> +int multifd_send_channels_created(void) >>> +{ >>> + int ret = 0; >>> + >>> + if (!migrate_multifd()) { >>> + return 0; >>> + } >>> + >>> + for (int i = 0; i < migrate_multifd_channels(); i++) { >>> + MultiFDSendParams *p = &multifd_send_state->params[i]; >>> + >>> + qemu_sem_wait(&p->create_sem); >>> + WITH_QEMU_LOCK_GUARD(&p->mutex) { >>> + if (p->quit) { >>> + error_report("%s: channel %d has already quit", __func__, >>> i); >>> + ret = -1; >>> + } >>> + } >> There are races here when a channel fails at >> multifd_send_initial_packet(). If p->quit can be set after post to >> create_sem, then this function could always return true and we'd run >> into a broken channel. Possibly even the same bug you're trying to fix. >> >> I think that's one of the reasons we have channels_ready. > > I am not sure exactly what bug you are describing here, can you elaborate? > This looks like it could be a preexisting issue actually, but in the current code, what stops the multifd channel from running ahead is p->sem. Except that the channel does some work at multifd_send_initial_packet() before checking p->sem and that work could fail. This means that right after checking for p->quit above, the multifd thread could already be exiting due to an error and multifd_send_channels_created() would miss it. So there's still a chance that this function effectively behaves just like the previous code. Thread | Code execution ------------------------------------------------------------------------ Migration | | ram_save_setup() | multifd_send_channels_created() | qemu_sem_wait(&p->create_sem); Main | | multifd_channel_connect() | qemu_thread_create() | qemu_sem_post(&p->create_sem) Multifd 1 | | multifd_send_initial_packet() *errors* | goto out | multifd_send_terminate_threads() Migration | | still at multifd_send_channels_created | qemu_mutex_lock(&p->mutex); | p->quit == false <--- !!! | qemu_mutex_unlock(&p->mutex); | return true from multifd_send_channels_created() >From here onwards, it's the same as not having checked multifd_send_channels_created() at all. One way this could go is: | runs unimpeded until multifd_send_sync_main() | multifd_send_pages() | *misses the exiting flag* | qemu_sem_wait(&multifd_send_state->channels_ready); Multifd 1 | | still at multifd_send_terminate_threads | qemu_mutex_lock(&p->mutex); | p->quit = true; | qemu_mutex_unlock(&p->mutex); | qemu_sem_post(&p->sem_sync); | qemu_sem_post(&multifd_send_state->channels_ready); Migration | | qemu_mutex_lock(&p->mutex); | p->quit == true; <--- correct now | qemu_mutex_unlock(&p->mutex); | return -1; | *all good again* It seems that in order to avoid this kind of race we'd need the synchronization point to be at the multifd thread instead. But then, that's what channels_ready does. >> >>> + } >>> + >>> + return ret; >>> +} >>> + >>> static bool multifd_channel_connect(MultiFDSendParams *p, >>> QIOChannel *ioc, >>> Error **errp); >>> @@ -794,6 +818,7 @@ static void multifd_tls_outgoing_handshake(QIOTask >>> *task, >>> p->quit = true; >>> qemu_sem_post(&multifd_send_state->channels_ready); >>> qemu_sem_post(&p->sem_sync); >>> + qemu_sem_post(&p->create_sem); >>> error_free(err); >>> } >>> >>> @@ -857,6 +882,7 @@ static bool multifd_channel_connect(MultiFDSendParams >>> *p, >>> qemu_thread_create(&p->thread, p->name, multifd_send_thread, p, >>> QEMU_THREAD_JOINABLE); >>> p->running = true; >>> + qemu_sem_post(&p->create_sem); >>> return true; >>> } >>> >>> @@ -864,15 +890,16 @@ static void >>> multifd_new_send_channel_cleanup(MultiFDSendParams *p, >>> QIOChannel *ioc, Error *err) >>> { >>> migrate_set_error(migrate_get_current(), err); >>> - /* Error happen, we need to tell who pay attention to me */ >>> - qemu_sem_post(&multifd_send_state->channels_ready); >>> - qemu_sem_post(&p->sem_sync); >>> /* >>> + * Error happen, we need to tell who pay attention to me. >>> * Although multifd_send_thread is not created, but main migration >>> * thread need to judge whether it is running, so we need to mark >>> * its status. >>> */ >>> p->quit = true; >>> + qemu_sem_post(&multifd_send_state->channels_ready); >>> + qemu_sem_post(&p->sem_sync); >>> + qemu_sem_post(&p->create_sem); >> Do we still need channels_ready and sem_sync here? The migration thread >> shouldn't have gone past create_sem at this point. > > I think you are right, we can drop channels_ready and sem_sync here. > >> >>> object_unref(OBJECT(ioc)); >>> error_free(err); >>> } >>> @@ -921,6 +948,7 @@ int multifd_save_setup(Error **errp) >>> MultiFDSendParams *p = &multifd_send_state->params[i]; >>> >>> qemu_mutex_init(&p->mutex); >>> + qemu_sem_init(&p->create_sem, 0); >>> qemu_sem_init(&p->sem, 0); >>> qemu_sem_init(&p->sem_sync, 0); >>> p->quit = false; >>> diff --git a/migration/ram.c b/migration/ram.c >>> index c0cdcccb75..b3e864a22b 100644 >>> --- a/migration/ram.c >>> +++ b/migration/ram.c >>> @@ -2937,6 +2937,13 @@ static int ram_save_setup(QEMUFile *f, void *opaque) >>> RAMBlock *block; >>> int ret; >>> >>> + bql_unlock(); >>> + ret = multifd_send_channels_created(); >>> + bql_lock(); >>> + if (ret < 0) { >>> + return ret; >>> + } >>> + >>> if (compress_threads_save_setup()) { >>> return -1; >>> }