Daniel P. Berrangé <berra...@redhat.com> writes:

> On Tue, Aug 05, 2025 at 10:44:41AM -0300, Fabiano Rosas wrote:
>> Daniel P. Berrangé <berra...@redhat.com> writes:
>> 
>> > On Mon, Aug 04, 2025 at 04:27:45PM -0300, Fabiano Rosas wrote:
>> >> Juraj Marcin <jmar...@redhat.com> writes:
>> >> 
>> >> > Hi Daniel,
>> >> >
>> >> > On 2025-08-01 18:02, Daniel P. Berrangé wrote:
>> >> >> This is a followup to previously merged patches that claimed to
>> >> >> workaround the gnutls bug impacting migration, but in fact were
>> >> >> essentially non-functional. Juraj Marcin pointed this out, and
>> >> >> this new patch tweaks the workaround to make it actually do
>> >> >> something useful.
>> >> >> 
>> >> >> Daniel P. Berrangé (2):
>> >> >>   migration: simplify error reporting after channel read
>> >> >>   migration: fix workaround for gnutls thread safety
>> >> >> 
>> >> >>  crypto/tlssession.c   | 16 ----------------
>> >> >>  migration/qemu-file.c | 22 +++++++++++++++++-----
>> >> >>  2 files changed, 17 insertions(+), 21 deletions(-)
>> >> >> 
>> >> >
>> >> > thanks for finding a fix for the workaround. I have tested it and it
>> >> > resolves the issue.
>> >> >
>> >> > However, it significantly slows down migration, even with the workaround
>> >> > disabled (and thus no locking). When benchmarking, I used the fixed
>> >> > version of GNUTLS, VM with 20GB of RAM which were fully written to
>> >> > before starting a normal migration with no workload during the
>> >> > migration.
>> >> >
>> >> > Test cases:
>> >> > [1]: before this patchset
>> >> > [2]: with this patchset applied and GNUTLS workaround enabled
>> >> > [2]: with this patchset applied and GNUTLS workaround disabled
>> >> >
>> >> >   | Total time | Throughput | Transfered bytes |
>> >> > --+------------+------------+------------------+
>> >> > 1 |  31 192 ms |  5450 mpbs |   21 230 973 763 |
>> >> > 2 |  74 147 ms |  2291 mbps |   21 232 849 066 |
>> >> > 3 |  72 426 ms |  2343 mbps |   21 215 009 392 |
>> >> 
>> >> Thanks testing this. I had just managed to convince myself that there
>> >> wouldn't be any performance issues.
>> >> 
>> >> The yield at every buffer fill on the incoming side is probably way more
>> >> impactful than the poll on the RP.
>> >
>> > Yeah, that's an unacceptable penalty on the incoming side for sure.
>> >
>> > How about we simply change the outbound migration channel to be in
>> > non-blocking mode ?   I originally put it in blocking mode way back
>> > in 9e4d2b98ee98f4cee50d671e500eceeefa751ee0, but if I look at the
>> > QEMUFile impl of qemu_fill_buffer and qemu_fflush, but should be
>> > coping with a non-blocking socket. qemu_fill_buffer has explicit
>> > code to wait, and qemu_fflush uses the _all() variant whcih has
>> > built-in support for waiting. So I'm not seeing an obvious need
>> > to run the channel in blocking mode.
>> >
>> 
>> It's definitely simpler and I think it works. It's uncomfortably late
>> though to add a bunch of glib event loop code to the migration
>> thread. Is the suggestion of moving the yield to tlssession.c even
>> feasible?
>
> Well that'll remove the burden for the non-TLS incoming migration,
> but the incoming TLS migration will still have the redundant
> yields and so still suffer a hit.
>
> Given where we are in freeze, I'm thinking we should just hard
> disable the workaround for this release, and re-attempt it in
> next cycle and then we can bring it back to stable afterwards.
>

Yes, I agree. Will you send a patch?

> With regards,
> Daniel

Reply via email to