multifd: Wait for multifd channels creation before proceeding

Fabiano Rosas Tue, 30 Jan 2024 13:33:24 -0800

Avihai Horon <avih...@nvidia.com> writes:

> On 29/01/2024 16:34, Fabiano Rosas wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> Avihai Horon <avih...@nvidia.com> writes:
>>
>>> Currently, multifd channels are created asynchronously without waiting
>>> for their creation -- migration simply proceeds and may wait in
>>> multifd_send_sync_main(), which is called by ram_save_setup(). This
>>> hides in it some race conditions which can cause an unexpected behavior
>>> if some channels creation fail.
>>>
>>> For example, the following scenario of multifd migration with two
>>> channels, where the first channel creation fails, will end in a
>>> segmentation fault (time advances from top to bottom):
>> Is this reproducible? Or just observable at least.
>
> Yes, though I had to engineer it a bit:
> 1. Run migration with two multifd channels and fail creation of the two 
> channels (e.g., by changing the address they are connecting to).
> 2. Add sleep(3) in multifd_send_sync_main() before we loop through the 
> channels and check p->quit.
> 3. Add sleep(5) only for the second multifd channel connect thread so 
> its connection is delayed and runs last.


Ok, well, that's something at least. I'll try to reproduce it so we can
keep track of it.

>> I acknowledge the situation you describe, but with multifd there's
>> usually an issue in cleanup paths. Let's make sure we flushed those out
>> before adding this new semaphore.
>
> Indeed, I was not keen on adding yet another semaphore either.
> I think there are multiple bugs here, some of them overlap and some don't.
> There is also your and Peter's previous work that I was not aware of to 
> fix those and to clean up the code.
>
> Maybe we can take it one step at a time, pushing your series first, 
> cleaning the code and fixing some bugs.
> Then we can see what bugs are left (if any) and fix them. It might even 
> be easier to fix after the cleanups.
>
>> This is similar to an issue Peter was addressing where we missed calling
>> multifd_send_termiante_threads() in the multifd_channel_connect() path:
>>
>> patch 4 in this
>> https://lore.kernel.org/r/20231022201211.452861-1-pet...@redhat.com
>
> What issue are you referring here? Can you elaborate?

Oh, I just realised that series doesn't address any particular bug. But
my point is that including a call to multifd_send_terminate_threads() at
new_send_channel_cleanup might be all that's needed because that has
code to cause the channels and the migration thread to end.

> The main issue I am trying to fix in my patch is that we don't wait for 
> all multifd channels to be created/error out before tearing down
> multifd resources in mulitfd_save_cleanup().

Ok, let me take a step back and ask why is this not solved by
multifd_save_cleanup() -> qemu_thread_join()? I see you moved
p->running=true to *after* the thread creation in patch 4. That will
always leave a gap where p->running == false but the thread is already
running.

>
>>> Thread           | Code execution
>>> ------------------------------------------------------------------------
>>> Multifd 1        |
>>>                   | multifd_new_send_channel_async (errors and quits)
>>>                   |   multifd_new_send_channel_cleanup
>>>                   |
>>> Migration thread |
>>>                   | qemu_savevm_state_setup
>>>                   |   ram_save_setup
>>>                   |     multifd_send_sync_main
>>>                   |     (detects Multifd 1 error and quits)
>>>                   | [...]
>>>                   | migration_iteration_finish
>>>                   |   migrate_fd_cleanup_schedule
>>>                   |
>>> Main thread      |
>>>                   | migrate_fd_cleanup
>>>                   |   multifd_save_cleanup (destroys Multifd 2 resources)
>>>                   |
>>> Multifd 2        |
>>>                   | multifd_new_send_channel_async
>>>                   | (accesses destroyed resources, segfault)
>>>
>>> In another scenario, migration can hang indefinitely:
>>> 1. Main migration thread reaches multifd_send_sync_main() and waits on
>>>     the semaphores.
>>> 2. Then, all multifd channels creation fails, so they post the
>>>     semaphores and quit.
>>> 3. Main migration channel will not identify the error, proceed to send
>>>     pages and will hang.
>>>
>>> Fix it by waiting for all multifd channels to be created before
>>> proceeding with migration.
>>>
>>> Signed-off-by: Avihai Horon <avih...@nvidia.com>
>>> ---
>>>   migration/multifd.h   |  3 +++
>>>   migration/migration.c |  1 +
>>>   migration/multifd.c   | 34 +++++++++++++++++++++++++++++++---
>>>   migration/ram.c       |  7 +++++++
>>>   4 files changed, 42 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/migration/multifd.h b/migration/multifd.h
>>> index 35d11f103c..87a64e0a87 100644
>>> --- a/migration/multifd.h
>>> +++ b/migration/multifd.h
>>> @@ -23,6 +23,7 @@ void multifd_recv_new_channel(QIOChannel *ioc, Error 
>>> **errp);
>>>   void multifd_recv_sync_main(void);
>>>   int multifd_send_sync_main(void);
>>>   int multifd_queue_page(RAMBlock *block, ram_addr_t offset);
>>> +int multifd_send_channels_created(void);
>>>
>>>   /* Multifd Compression flags */
>>>   #define MULTIFD_FLAG_SYNC (1 << 0)
>>> @@ -86,6 +87,8 @@ typedef struct {
>>>       /* multifd flags for sending ram */
>>>       int write_flags;
>>>
>>> +    /* Syncs channel creation and migration thread */
>>> +    QemuSemaphore create_sem;
>>>       /* sem where to wait for more work */
>>>       QemuSemaphore sem;
>>>       /* syncs main thread and channels */
>>> diff --git a/migration/migration.c b/migration/migration.c
>>> index 9c769a1ecd..d81d96eaa5 100644
>>> --- a/migration/migration.c
>>> +++ b/migration/migration.c
>>> @@ -3621,6 +3621,7 @@ void migrate_fd_connect(MigrationState *s, Error 
>>> *error_in)
>>>           error_report_err(local_err);
>>>           migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
>>>                             MIGRATION_STATUS_FAILED);
>>> +        multifd_send_channels_created();
>>>           migrate_fd_cleanup(s);
>>>           return;
>>>       }
>>> diff --git a/migration/multifd.c b/migration/multifd.c
>>> index 564e911b6c..f0c216f4f9 100644
>>> --- a/migration/multifd.c
>>> +++ b/migration/multifd.c
>>> @@ -538,6 +538,7 @@ void multifd_save_cleanup(void)
>>>           multifd_send_channel_destroy(p->c);
>>>           p->c = NULL;
>>>           qemu_mutex_destroy(&p->mutex);
>>> +        qemu_sem_destroy(&p->create_sem);
>>>           qemu_sem_destroy(&p->sem);
>>>           qemu_sem_destroy(&p->sem_sync);
>>>           g_free(p->name);
>>> @@ -766,6 +767,29 @@ out:
>>>       return NULL;
>>>   }
>>>
>>> +int multifd_send_channels_created(void)
>>> +{
>>> +    int ret = 0;
>>> +
>>> +    if (!migrate_multifd()) {
>>> +        return 0;
>>> +    }
>>> +
>>> +    for (int i = 0; i < migrate_multifd_channels(); i++) {
>>> +        MultiFDSendParams *p = &multifd_send_state->params[i];
>>> +
>>> +        qemu_sem_wait(&p->create_sem);
>>> +        WITH_QEMU_LOCK_GUARD(&p->mutex) {
>>> +            if (p->quit) {
>>> +                error_report("%s: channel %d has already quit", __func__, 
>>> i);
>>> +                ret = -1;
>>> +            }
>>> +        }
>> There are races here when a channel fails at
>> multifd_send_initial_packet(). If p->quit can be set after post to
>> create_sem, then this function could always return true and we'd run
>> into a broken channel. Possibly even the same bug you're trying to fix.
>>
>> I think that's one of the reasons we have channels_ready.
>
> I am not sure exactly what bug you are describing here, can you elaborate?
>

This looks like it could be a preexisting issue actually, but in the
current code, what stops the multifd channel from running ahead is
p->sem. Except that the channel does some work at
multifd_send_initial_packet() before checking p->sem and that work could
fail.

This means that right after checking for p->quit above, the multifd
thread could already be exiting due to an error and
multifd_send_channels_created() would miss it. So there's still a chance
that this function effectively behaves just like the previous code.

Thread           | Code execution
------------------------------------------------------------------------
Migration        |
                 | ram_save_setup()
                 |   multifd_send_channels_created()
                 |     qemu_sem_wait(&p->create_sem);
Main             |
                 | multifd_channel_connect()
                 |   qemu_thread_create()
                 |   qemu_sem_post(&p->create_sem)
Multifd 1        |
                 | multifd_send_initial_packet() *errors*
                 | goto out
                 | multifd_send_terminate_threads()
Migration        |
                 | still at multifd_send_channels_created
                 |     qemu_mutex_lock(&p->mutex);
                 |     p->quit == false      <--- !!!
                 |     qemu_mutex_unlock(&p->mutex);
                 | return true from multifd_send_channels_created()

>From here onwards, it's the same as not having checked
multifd_send_channels_created() at all. One way this could go is:

                 | runs unimpeded until multifd_send_sync_main()
                 | multifd_send_pages()
                 | *misses the exiting flag*
                 | qemu_sem_wait(&multifd_send_state->channels_ready);
Multifd 1        |
                 | still at multifd_send_terminate_threads
                 |   qemu_mutex_lock(&p->mutex);
                 |   p->quit = true;
                 |   qemu_mutex_unlock(&p->mutex);
                 | qemu_sem_post(&p->sem_sync);
                 | qemu_sem_post(&multifd_send_state->channels_ready);
Migration        |
                 | qemu_mutex_lock(&p->mutex);
                 | p->quit == true;  <--- correct now
                 | qemu_mutex_unlock(&p->mutex);
                 | return -1;
                 | *all good again*

It seems that in order to avoid this kind of race we'd need the
synchronization point to be at the multifd thread instead. But then,
that's what channels_ready does.

>>
>>> +    }
>>> +
>>> +    return ret;
>>> +}
>>> +
>>>   static bool multifd_channel_connect(MultiFDSendParams *p,
>>>                                       QIOChannel *ioc,
>>>                                       Error **errp);
>>> @@ -794,6 +818,7 @@ static void multifd_tls_outgoing_handshake(QIOTask 
>>> *task,
>>>       p->quit = true;
>>>       qemu_sem_post(&multifd_send_state->channels_ready);
>>>       qemu_sem_post(&p->sem_sync);
>>> +    qemu_sem_post(&p->create_sem);
>>>       error_free(err);
>>>   }
>>>
>>> @@ -857,6 +882,7 @@ static bool multifd_channel_connect(MultiFDSendParams 
>>> *p,
>>>       qemu_thread_create(&p->thread, p->name, multifd_send_thread, p,
>>>                          QEMU_THREAD_JOINABLE);
>>>       p->running = true;
>>> +    qemu_sem_post(&p->create_sem);
>>>       return true;
>>>   }
>>>
>>> @@ -864,15 +890,16 @@ static void 
>>> multifd_new_send_channel_cleanup(MultiFDSendParams *p,
>>>                                                QIOChannel *ioc, Error *err)
>>>   {
>>>        migrate_set_error(migrate_get_current(), err);
>>> -     /* Error happen, we need to tell who pay attention to me */
>>> -     qemu_sem_post(&multifd_send_state->channels_ready);
>>> -     qemu_sem_post(&p->sem_sync);
>>>        /*
>>> +      * Error happen, we need to tell who pay attention to me.
>>>         * Although multifd_send_thread is not created, but main migration
>>>         * thread need to judge whether it is running, so we need to mark
>>>         * its status.
>>>         */
>>>        p->quit = true;
>>> +     qemu_sem_post(&multifd_send_state->channels_ready);
>>> +     qemu_sem_post(&p->sem_sync);
>>> +     qemu_sem_post(&p->create_sem);
>> Do we still need channels_ready and sem_sync here? The migration thread
>> shouldn't have gone past create_sem at this point.
>
> I think you are right, we can drop channels_ready and sem_sync here.
>
>>
>>>        object_unref(OBJECT(ioc));
>>>        error_free(err);
>>>   }
>>> @@ -921,6 +948,7 @@ int multifd_save_setup(Error **errp)
>>>           MultiFDSendParams *p = &multifd_send_state->params[i];
>>>
>>>           qemu_mutex_init(&p->mutex);
>>> +        qemu_sem_init(&p->create_sem, 0);
>>>           qemu_sem_init(&p->sem, 0);
>>>           qemu_sem_init(&p->sem_sync, 0);
>>>           p->quit = false;
>>> diff --git a/migration/ram.c b/migration/ram.c
>>> index c0cdcccb75..b3e864a22b 100644
>>> --- a/migration/ram.c
>>> +++ b/migration/ram.c
>>> @@ -2937,6 +2937,13 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
>>>       RAMBlock *block;
>>>       int ret;
>>>
>>> +    bql_unlock();
>>> +    ret = multifd_send_channels_created();
>>> +    bql_lock();
>>> +    if (ret < 0) {
>>> +        return ret;
>>> +    }
>>> +
>>>       if (compress_threads_save_setup()) {
>>>           return -1;
>>>       }

Re: [PATCH 05/17] migration/multifd: Wait for multifd channels creation before proceeding

Reply via email to