On Tue, Aug 05, 2025 at 10:44:41AM -0300, Fabiano Rosas wrote: > Daniel P. Berrangé <berra...@redhat.com> writes: > > > On Mon, Aug 04, 2025 at 04:27:45PM -0300, Fabiano Rosas wrote: > >> Juraj Marcin <jmar...@redhat.com> writes: > >> > >> > Hi Daniel, > >> > > >> > On 2025-08-01 18:02, Daniel P. Berrangé wrote: > >> >> This is a followup to previously merged patches that claimed to > >> >> workaround the gnutls bug impacting migration, but in fact were > >> >> essentially non-functional. Juraj Marcin pointed this out, and > >> >> this new patch tweaks the workaround to make it actually do > >> >> something useful. > >> >> > >> >> Daniel P. Berrangé (2): > >> >> migration: simplify error reporting after channel read > >> >> migration: fix workaround for gnutls thread safety > >> >> > >> >> crypto/tlssession.c | 16 ---------------- > >> >> migration/qemu-file.c | 22 +++++++++++++++++----- > >> >> 2 files changed, 17 insertions(+), 21 deletions(-) > >> >> > >> > > >> > thanks for finding a fix for the workaround. I have tested it and it > >> > resolves the issue. > >> > > >> > However, it significantly slows down migration, even with the workaround > >> > disabled (and thus no locking). When benchmarking, I used the fixed > >> > version of GNUTLS, VM with 20GB of RAM which were fully written to > >> > before starting a normal migration with no workload during the > >> > migration. > >> > > >> > Test cases: > >> > [1]: before this patchset > >> > [2]: with this patchset applied and GNUTLS workaround enabled > >> > [2]: with this patchset applied and GNUTLS workaround disabled > >> > > >> > | Total time | Throughput | Transfered bytes | > >> > --+------------+------------+------------------+ > >> > 1 | 31 192 ms | 5450 mpbs | 21 230 973 763 | > >> > 2 | 74 147 ms | 2291 mbps | 21 232 849 066 | > >> > 3 | 72 426 ms | 2343 mbps | 21 215 009 392 | > >> > >> Thanks testing this. I had just managed to convince myself that there > >> wouldn't be any performance issues. > >> > >> The yield at every buffer fill on the incoming side is probably way more > >> impactful than the poll on the RP. > > > > Yeah, that's an unacceptable penalty on the incoming side for sure. > > > > How about we simply change the outbound migration channel to be in > > non-blocking mode ? I originally put it in blocking mode way back > > in 9e4d2b98ee98f4cee50d671e500eceeefa751ee0, but if I look at the > > QEMUFile impl of qemu_fill_buffer and qemu_fflush, but should be > > coping with a non-blocking socket. qemu_fill_buffer has explicit > > code to wait, and qemu_fflush uses the _all() variant whcih has > > built-in support for waiting. So I'm not seeing an obvious need > > to run the channel in blocking mode. > > > > It's definitely simpler and I think it works. It's uncomfortably late > though to add a bunch of glib event loop code to the migration > thread. Is the suggestion of moving the yield to tlssession.c even > feasible?
Well that'll remove the burden for the non-TLS incoming migration, but the incoming TLS migration will still have the redundant yields and so still suffer a hit. Given where we are in freeze, I'm thinking we should just hard disable the workaround for this release, and re-attempt it in next cycle and then we can bring it back to stable afterwards. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|