Hi,
Vishal and I have been benchmarking the impact of the several Tx-batching
patches on the performance of OVS in the phy-VM-phy scenario with different
applications in the VM:
The OVS versions we tested are:
(master): OVS master (
(Ilya-3): Output batching within one Rx batch :
(master) + [PATCH v3 1-3/4] Output packet batching
(Ilya-6): Time-based output batching with us resolution using
CLOCK_MONOTONIC
(Ilya-3) + [PATCH RFC v3 4/4] dpif-netdev: Time based output
batching +
[PATCH RFC 1/2] timeval: Introduce time_usec() +
[PATCH RFC 2/2] dpif-netdev: Use microseconds granularity for
output-max-latency.
(Ilya-4-Jan): Time-based output batching with us resolution using TSC cycles
(Ilya-3) + [PATCH RFC v3 4/4] dpif-netdev: Time based output
batching +
Incremental patch using TSC cycles in
https://mail.openvswitch.org/pipermail/ovs-dev/2017-August/337402.html
Application 1: iperf server as representative for kernel applications:
The iperf server executes in a VM with 2 vCPUs where both virtio interrupts and
iperf process are pinned to the same vCPU for best performance. The iperf
client also runs in a VM on a different server. OVS nodes on client and server
are configured identically.
Iperf iperf CPU Ping
OVS version Gbps Avg.PMD cycles/pkt PMD util host util rtt
------------------------------------------------------------------------
Master 6.83 1708.63 43.50% 100% 39 us
Ilya-3 6.88 1951.35 47.17% 100% 40 us
Ilya-6 50 us 7.83 1049.21 31.74% 99.7% 228 us
Ilya-4-Jan 50 us 7.75 1086.2 30.65% 99.7% 230 us
Discussion:
- Without time-based Tx batching the iperf server CPU is the bottleneck due to
virtio interrupt load.
- Ilya-3 does not provide any benefit.
- With 50us time-based batching the PMD load reduces by 1/3rd (less kicks to
the virtio eventfd).
- The iperf throughput increases by 15%, still limited by the vCPU capacity.
But the bottleneck moves from the virtio interrupt handlers in the guest kernel
to the TCP stack and iperf process. With multiple threads can fully load the
10G physical link.
- As expected the RTT latency increases by 190 ~= 4*50 us (2 OVS hops on server
and client side)
- There is no significant difference between the CLOCK_MONOTONIC and the
TSC-based implementations.
Application 2: dpdk pktgen as representative for DPDK application:
OVS version max-latency Mpps Avg.PMD cycles/pkt PMD utilization
----------------------------------------------------------------------
Master n/a 3.92 305.43 99.65%
Ilya-3 n/a 3.84 310.58 99.31%
Ilya-6 0 us 3.82 312.47 99.67%
Ilya-6 50 us 3.80 314.60 99.65%
Ilya-4-Jan 50 us 3.78 313.65 98.86%
Discussion:
- For DPDK applications in the VM Tx batching does not provide any throughput
benefit.
- At full PMD load the output batching overhead causes a capacity drop of 2-3%.
- There is no significant difference between CLOCK_MONOTONIC and TSC
implementations.
- perf top measurements indicate that the clock_gettime system call eats about
0.6% of the PMD cycles. This appears not enough to replace it by some TSC-based
time implementation.
A zip file with the detailed measurement results can be downloaded from
https://drive.google.com/open?id=0ByBuumQUR_NYNlRzbUhJX2R6NW8
Conclusions:
-----------------
1. Time based Tx-batching provides significant performance improvements for
kernel-based applications.
2. DPDK applications do not benefit in throughput but suffer from the latency
increase.
3. The worst case overhead implied by Tx batching is about 3% and should be
acceptable.
4. As there is the obvious trade-off between throughput improvement and latency
increase, the maximum output latency should be a configuration option. Ideally
OVS should have a default parameter per switch and an additional parameter per
interface to override the default parameter.
5. Ilya's CLOCK_MONOTONIC implementation seems efficient enough. No urgent need
to go replace this by some TSC-based clock.
Regards, Jan and Vishal
> -----Original Message-----
> From: Ilya Maximets [mailto:[email protected]]
> Sent: Monday, 14 August, 2017 14:10
> To: [email protected]; Jan Scheurich
> <[email protected]>
> Cc: Bhanuprakash Bodireddy <[email protected]>;
> Heetae Ahn <[email protected]>; Vishal Deep Ajmera
> <[email protected]>; Ilya Maximets
> <[email protected]>
> Subject: [PATCH RFC 2/2] dpif-netdev: Use microseconds granularity for
> output-max-latency.
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev