Peter Xu <[email protected]> writes: > On Fri, Sep 19, 2025 at 10:50:56AM -0300, Fabiano Rosas wrote: >> Peter Xu <[email protected]> writes: >> >> > On Thu, Sep 18, 2025 at 06:17:37PM -0300, Fabiano Rosas wrote: >> >> > ============= ABOUT OLD PATCH 2 =================== >> >> > >> >> > I dropped it for now to unblock almost patch 1, because patch 1 will >> >> > fix a >> >> > real warning that can be triggered for not only qtest but also normal >> >> > tls >> >> > postcopy migration. >> >> > >> >> > While I was looking at temporary settings for multifd send iochannels >> >> > to be >> >> > blocking always, I found I cannot explain how >> >> > migration_tls_channel_end() >> >> > currently works, because it writes to the multifd iochannels while the >> >> > channels should still be owned (and can be written at the same time?) by >> >> > the sender threads. It sounds like a thread-safety issue, or is it not? >> >> > >> >> >> >> IIUC, the multifd channels will be stuck at p->sem because this is the >> >> success path so migration will have already finished when we reach >> >> migration_cleanup(). The ram/device state migration will hold the main >> >> thread until the multifd channels finish transferring. >> > >> > For success cases, indeed. However this is not the success path? After >> > all, we check migration_has_failed(). >> > >> >> My point is that when we reach here, if migration has succeeded, then it >> should be ok. If not, then thread-safety doesn't matter because things >> have already went bad, we'll lose the destination anyway. > > I'm not sure if it matters or not, maybe it depends on how bad it is when a > race happened. > > If it's a tcp channel, it might be easier; the worst case is we write() > concurrently in two threads and the output stream, IIUC, can be interleaved > with the two buffers we write. Not an issue if migration failed anyway. > > However this is only needed for TLS, hence I have no idea what happens if > gnutls writes concurrently. I don't think GnuTLS supports concurrent > writters. I'm not sure if it means there's still chance src QEMU (when > having a failed live migration) can crash. > > So.. I still think it might be wise we only bye() after knowing it is a > success, not only because that looks like the only way to make sure it's > thread-safe, but also because a bye() is only needed if it didn't fail. > Sending it ignoring error is another way of doing so, but it doesn't avoid > the possible result of a race (even if I totally agree it is unlikely..). >
ok >> >> > Should I then send a patch to only send bye() when succeeded? Then I can >> > also add some comment. I wished we could assert. Then the "temporarily >> > changing nonblock mode" will also rely on this one, because ideally we >> > shouldn't touch the fd nonblocking mode if some other thread is operating >> > on it. >> > >> >> I don't know if it changes much. Currently we basically always ignore >> the error from bye(). >> >> > The other thing is I also think we shouldn't rely on checking >> > "p->tls_thread_created && p->thread_created" but only rely on channel type, >> > which might be more straightforward (I almost did it in v1, but v2 rewrote >> > things so it was lost). >> >> Ok, but we may need to ensure bye() is not called before the session is >> initiated. So thread_created may still be needed? > > In v1, I was using "object_dynamic_cast((Object *)c, TYPE_QIO_CHANNEL_TLS)": > > https://lore.kernel.org/all/[email protected]/ > > Would that work the same, but without relying on "thread_created" > vars? Ok, I'm convinced. migration_cleanup() -> multifd_send_shutdown() -> bye() cannot happen before thread_create=true because multifd_send_setup() blocks the migration_thread until the channels have been fully created. Go ahead then!
