Peter Xu <[email protected]> writes:

> On Fri, Sep 19, 2025 at 10:50:56AM -0300, Fabiano Rosas wrote:
>> Peter Xu <[email protected]> writes:
>> 
>> > On Thu, Sep 18, 2025 at 06:17:37PM -0300, Fabiano Rosas wrote:
>> >> > ============= ABOUT OLD PATCH 2 ===================
>> >> >
>> >> > I dropped it for now to unblock almost patch 1, because patch 1 will 
>> >> > fix a
>> >> > real warning that can be triggered for not only qtest but also normal 
>> >> > tls
>> >> > postcopy migration.
>> >> >
>> >> > While I was looking at temporary settings for multifd send iochannels 
>> >> > to be
>> >> > blocking always, I found I cannot explain how 
>> >> > migration_tls_channel_end()
>> >> > currently works, because it writes to the multifd iochannels while the
>> >> > channels should still be owned (and can be written at the same time?) by
>> >> > the sender threads.  It sounds like a thread-safety issue, or is it not?
>> >> >
>> >> 
>> >> IIUC, the multifd channels will be stuck at p->sem because this is the
>> >> success path so migration will have already finished when we reach
>> >> migration_cleanup(). The ram/device state migration will hold the main
>> >> thread until the multifd channels finish transferring.
>> >
>> > For success cases, indeed.  However this is not the success path?  After
>> > all, we check migration_has_failed().
>> >
>> 
>> My point is that when we reach here, if migration has succeeded, then it
>> should be ok. If not, then thread-safety doesn't matter because things
>> have already went bad, we'll lose the destination anyway.
>
> I'm not sure if it matters or not, maybe it depends on how bad it is when a
> race happened.
>
> If it's a tcp channel, it might be easier; the worst case is we write()
> concurrently in two threads and the output stream, IIUC, can be interleaved
> with the two buffers we write.  Not an issue if migration failed anyway.
>
> However this is only needed for TLS, hence I have no idea what happens if
> gnutls writes concurrently.  I don't think GnuTLS supports concurrent
> writters.  I'm not sure if it means there's still chance src QEMU (when
> having a failed live migration) can crash.
>
> So.. I still think it might be wise we only bye() after knowing it is a
> success, not only because that looks like the only way to make sure it's
> thread-safe, but also because a bye() is only needed if it didn't fail.
> Sending it ignoring error is another way of doing so, but it doesn't avoid
> the possible result of a race (even if I totally agree it is unlikely..).
>

ok

>> 
>> > Should I then send a patch to only send bye() when succeeded?  Then I can
>> > also add some comment.  I wished we could assert.  Then the "temporarily
>> > changing nonblock mode" will also rely on this one, because ideally we
>> > shouldn't touch the fd nonblocking mode if some other thread is operating
>> > on it.
>> >
>> 
>> I don't know if it changes much. Currently we basically always ignore
>> the error from bye().
>> 
>> > The other thing is I also think we shouldn't rely on checking
>> > "p->tls_thread_created && p->thread_created" but only rely on channel type,
>> > which might be more straightforward (I almost did it in v1, but v2 rewrote
>> > things so it was lost).
>> 
>> Ok, but we may need to ensure bye() is not called before the session is
>> initiated. So thread_created may still be needed?
>
> In v1, I was using "object_dynamic_cast((Object *)c, TYPE_QIO_CHANNEL_TLS)":
>
> https://lore.kernel.org/all/[email protected]/
>
> Would that work the same, but without relying on "thread_created"
> vars?

Ok, I'm convinced. migration_cleanup() -> multifd_send_shutdown() ->
bye() cannot happen before thread_create=true because
multifd_send_setup() blocks the migration_thread until the channels have
been fully created. Go ahead then!

Reply via email to