Daniel P. Berrangé <berra...@redhat.com> writes: > On Tue, Aug 05, 2025 at 10:44:41AM -0300, Fabiano Rosas wrote: >> Daniel P. Berrangé <berra...@redhat.com> writes: >> >> > On Mon, Aug 04, 2025 at 04:27:45PM -0300, Fabiano Rosas wrote: >> >> Juraj Marcin <jmar...@redhat.com> writes: >> >> >> >> > Hi Daniel, >> >> > >> >> > On 2025-08-01 18:02, Daniel P. Berrangé wrote: >> >> >> This is a followup to previously merged patches that claimed to >> >> >> workaround the gnutls bug impacting migration, but in fact were >> >> >> essentially non-functional. Juraj Marcin pointed this out, and >> >> >> this new patch tweaks the workaround to make it actually do >> >> >> something useful. >> >> >> >> >> >> Daniel P. Berrangé (2): >> >> >> migration: simplify error reporting after channel read >> >> >> migration: fix workaround for gnutls thread safety >> >> >> >> >> >> crypto/tlssession.c | 16 ---------------- >> >> >> migration/qemu-file.c | 22 +++++++++++++++++----- >> >> >> 2 files changed, 17 insertions(+), 21 deletions(-) >> >> >> >> >> > >> >> > thanks for finding a fix for the workaround. I have tested it and it >> >> > resolves the issue. >> >> > >> >> > However, it significantly slows down migration, even with the workaround >> >> > disabled (and thus no locking). When benchmarking, I used the fixed >> >> > version of GNUTLS, VM with 20GB of RAM which were fully written to >> >> > before starting a normal migration with no workload during the >> >> > migration. >> >> > >> >> > Test cases: >> >> > [1]: before this patchset >> >> > [2]: with this patchset applied and GNUTLS workaround enabled >> >> > [2]: with this patchset applied and GNUTLS workaround disabled >> >> > >> >> > | Total time | Throughput | Transfered bytes | >> >> > --+------------+------------+------------------+ >> >> > 1 | 31 192 ms | 5450 mpbs | 21 230 973 763 | >> >> > 2 | 74 147 ms | 2291 mbps | 21 232 849 066 | >> >> > 3 | 72 426 ms | 2343 mbps | 21 215 009 392 | >> >> >> >> Thanks testing this. I had just managed to convince myself that there >> >> wouldn't be any performance issues. >> >> >> >> The yield at every buffer fill on the incoming side is probably way more >> >> impactful than the poll on the RP. >> > >> > Yeah, that's an unacceptable penalty on the incoming side for sure. >> > >> > How about we simply change the outbound migration channel to be in >> > non-blocking mode ? I originally put it in blocking mode way back >> > in 9e4d2b98ee98f4cee50d671e500eceeefa751ee0, but if I look at the >> > QEMUFile impl of qemu_fill_buffer and qemu_fflush, but should be >> > coping with a non-blocking socket. qemu_fill_buffer has explicit >> > code to wait, and qemu_fflush uses the _all() variant whcih has >> > built-in support for waiting. So I'm not seeing an obvious need >> > to run the channel in blocking mode. >> > >> >> It's definitely simpler and I think it works. It's uncomfortably late >> though to add a bunch of glib event loop code to the migration >> thread. Is the suggestion of moving the yield to tlssession.c even >> feasible? > > Well that'll remove the burden for the non-TLS incoming migration, > but the incoming TLS migration will still have the redundant > yields and so still suffer a hit. > > Given where we are in freeze, I'm thinking we should just hard > disable the workaround for this release, and re-attempt it in > next cycle and then we can bring it back to stable afterwards. >
Yes, I agree. Will you send a patch? > With regards, > Daniel