Hi Avi,

Sorry, I got distracted from this ...

On Sat, 2008-07-26 at 12:45 +0300, Avi Kivity wrote:
> Mark McLoughlin wrote:

> >   1) The length of the tx mitigation timer makes quite a difference to
> >      throughput achieved; we probably need a good heuristic for
> >      adjusting this on the fly.
> >   
> 
> The tx mitigation timer is just one part of the equation; the other is 
> the virtio ring window size, which is now fixed.
> 
> Using a maximum sized window is good when the guest and host are running 
> flat out, doing nothing but networking.  When throughput drops (because 
> the guest is spending cpu on processing, or simply because the other 
> side is not keeping up), we need to drop the windows size so as to 
> retain acceptable latencies.
> 
> The tx timer can then be set to "a bit after the end of the window", 
> acting as a safety belt in case the throughput changes.

i.e. the tx timer should give just enough time for a flat out guest to
fill the ring, and no more?

Yep, that's basically what lguest's tx timer heuristic is aiming for
AFAICT. 

> >   4) Dropping the global mutex while reading GSO packets from the tap
> >      interface gives a nice speedup. This highlights the global mutex
> >      as a general perfomance issue.
> >
> >   
> 
> Not sure whether this is safe.  What's stopping the guest from accessing 
> virtio and changing some state?

With the current code, the virtio state should be consistent before we
drop the mutex. The I/O thread would only drop the lock while it reads
into the tap buffer and then grab the lock again before popping a buffer
from the ring and copying to it.

With Anthony's zero-copy patch, the situation is less clear - we pop a
buffer from the avail, drop the lock, read() into the buffer, grab the
lock and then push the buffer back onto the used ring. While the mutex
is released, the guest could e.g. reset the ring and release the buffer
which we're in the process of read()ing too.

So, yes - dropping the mutex during read() in the zero-copy patch isn't
safe.

Another potential concern is that if we drop the mutex, the guest thread
could delete an I/O handler while the I/O thread is in the I/O handler
loop in main_loop_wait(). However, this seems to have been coded to
handle this situation - the I/O handler would only be marked as deleted,
and ignored by the loop.

Oh, and there's the posix-timers/signalfd race condition that I can only
seem to trigger when dropping the mutex.

> >   5) Eliminating an extra copy on the host->guest path only makes a
> >      barely measurable difference.
> >
> >   
> 
> That's expected on a host->guest test.  Zero copy is mostly important 
> for guest->external, and with zerocopy already enabled in the guest 
> (sendfile or nfs server workloads).

Hmm, could you elaborate on that?

The copy we're eliminating here is an intermediate copy from tapfd into
a buffer before copying to a guest buffer. It doesn't give you zero-copy
as we still copy from kernel space to user space and vice-versa.

> >         Anyway, the figures:
> >
> >   netperf, 10x20s runs (Gb/s)  |       guest->host          |       
> > host->guest
> >   
> > -----------------------------+----------------------------+---------------------------
> >   baseline                     | 1.520/ 1.573/ 1.610/ 0.034 | 1.160/ 1.357/ 
> > 1.630/ 0.165
> >   50us tx timer + rearm        | 1.050/ 1.086/ 1.110/ 0.017 | 1.710/ 1.832/ 
> > 1.960/ 0.092
> >   250us tx timer + rearm       | 1.700/ 1.764/ 1.880/ 0.064 | 0.900/ 1.203/ 
> > 1.580/ 0.205
> >   150us tx timer + rearm       | 1.520/ 1.602/ 1.690/ 0.044 | 1.670/ 1.928/ 
> > 2.150/ 0.141
> >   no ring-full heuristic       | 1.480/ 1.569/ 1.710/ 0.066 | 1.610/ 1.857/ 
> > 2.140/ 0.153
> >   VIRTIO_F_NOTIFY_ON_EMPTY     | 1.470/ 1.554/ 1.650/ 0.054 | 1.770/ 1.960/ 
> > 2.170/ 0.119
> >   recv NO_NOTIFY               | 1.530/ 1.604/ 1.680/ 0.047 | 1.780/ 1.944/ 
> > 2.190/ 0.129
> >   GSO                          | 4.120/ 4.323/ 4.420/ 0.099 | 6.540/ 7.033/ 
> > 7.340/ 0.244
> >   ring size == 256             | 4.050/ 4.406/ 4.560/ 0.143 | 6.280/ 7.236/ 
> > 8.280/ 0.613
> >   ring size == 512             | 4.420/ 4.600/ 4.960/ 0.140 | 6.470/ 7.205/ 
> > 7.510/ 0.314
> >   drop mutex during tapfd read | 4.320/ 4.578/ 4.790/ 0.161 | 8.370/ 8.589/ 
> > 8.730/ 0.120
> >   aligouri zero-copy           | 4.510/ 4.694/ 4.960/ 0.148 | 8.430/ 8.614/ 
> > 8.840/ 0.142
> >   
> 
> Very impressive numbers; much better than I expected.  The host->guest 
> numbers are around 100x better than the original emulated card througput 
> we got from kvm.

Hmm, I had intended to post comparison numbers for e1000; if I re-run
these again, I'll do that too. I think it's about 0.5 Gb/s guest->host
and 1.5 Gb/s host->guest.

Note also that with current Fedora rawhide kernels the numbers aren't
nearly as good as the numbers using my own kernel builds. This could be
down to the fact that we enable pretty much all kernel debugging options
during this phase of Fedora development. If I get a chance, I'll see if
I can confirm that suspicion.

> >   ping -f -c 100000 (ms)       |       guest->host          |       
> > host->guest
> >   
> > -----------------------------+----------------------------+---------------------------
> >   baseline                     | 0.060/ 0.459/ 7.602/ 0.846 | 0.067/ 0.331/ 
> > 2.517/ 0.057
> >   50us tx timer + rearm        | 0.081/ 0.143/ 7.436/ 0.374 | 0.093/ 0.133/ 
> > 1.883/ 0.026
> >   250us tx timer + rearm       | 0.302/ 0.463/ 7.580/ 0.849 | 0.297/ 0.344/ 
> > 2.128/ 0.028
> >   150us tx timer + rearm       | 0.197/ 0.323/ 7.671/ 0.740 | 0.199/ 0.245/ 
> > 7.836/ 0.037
> >   no ring-full heuristic       | 0.182/ 0.324/ 7.688/ 0.753 | 0.199/ 0.243/ 
> > 2.197/ 0.030
> >   VIRTIO_F_NOTIFY_ON_EMPTY     | 0.197/ 0.321/ 7.447/ 0.730 | 0.196/ 0.242/ 
> > 2.218/ 0.032
> >   recv NO_NOTIFY               | 0.186/ 0.321/ 7.520/ 0.732 | 0.200/ 0.233/ 
> > 2.216/ 0.028
> >   GSO                          | 0.178/ 0.324/ 7.667/ 0.736 | 0.147/ 0.246/ 
> > 1.361/ 0.024
> >   ring size == 256             | 0.184/ 0.323/ 7.674/ 0.728 | 0.199/ 0.243/ 
> > 2.181/ 0.028
> >   ring size == 512             |             (not measured) |             
> > (not measured)
> >   drop mutex during tapfd read | 0.183/ 0.323/ 7.820/ 0.733 | 0.202/ 0.242/ 
> > 2.219/ 0.027
> >   aligouri zero-copy           | 0.185/ 0.325/ 7.863/ 0.736 | 0.202/ 0.245/ 
> > 7.844/ 0.036
> >
> >   
> 
> This isn't too good.  Low latency is important for nfs clients (or other 
> request/response workloads).  I think we can keep these low by adjusting 
> the virtio window (for example, on an idle system it should be 1), so 
> that the tx mitigation timer only fires when the workload transitions 
> from throughput to request/response

The other suggestion to help latency is to always flush the queue
immediately on guest notification, before setting the timer.

Cheers,
Mark.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to