Re: [ovs-dev] [PATCH RFC 2/2] dpif-netdev: Use microseconds granularity for output-max-latency.

Jan Scheurich Sat, 02 Sep 2017 08:15:07 -0700

Hi,

Vishal and I have been benchmarking the impact of the several Tx-batching 
patches on the performance of OVS in the phy-VM-phy scenario with different 
applications in the VM:


The OVS versions we tested are:

(master):       OVS master (
(Ilya-3):       Output batching within one Rx batch :
                (master) + [PATCH v3 1-3/4] Output packet batching
(Ilya-6):       Time-based output batching with us resolution using 
CLOCK_MONOTONIC
                (Ilya-3) +  [PATCH RFC v3 4/4] dpif-netdev: Time based output 
batching + 
                [PATCH RFC 1/2] timeval: Introduce time_usec() + 
                [PATCH RFC 2/2] dpif-netdev: Use microseconds granularity for 
output-max-latency.
(Ilya-4-Jan):   Time-based output batching with us resolution using TSC cycles
                (Ilya-3) +  [PATCH RFC v3 4/4] dpif-netdev: Time based output 
batching + 
                Incremental patch using TSC cycles in 
                
https://mail.openvswitch.org/pipermail/ovs-dev/2017-August/337402.html

Application 1: iperf server as representative for kernel applications:

The iperf server executes in a VM with 2 vCPUs where both virtio interrupts and 
iperf process are pinned to the same vCPU for best performance. The iperf 
client also runs in a VM on a different server. OVS nodes on client and server 
are configured identically.

                Iperf                                 iperf CPU  Ping
OVS version      Gbps   Avg.PMD cycles/pkt  PMD util  host util  rtt
------------------------------------------------------------------------
Master           6.83        1708.63        43.50%      100%     39 us
Ilya-3           6.88        1951.35        47.17%      100%     40 us
Ilya-6 50 us     7.83        1049.21        31.74%      99.7%   228 us
Ilya-4-Jan 50 us 7.75        1086.2         30.65%      99.7%   230 us

Discussion:
- Without time-based Tx batching the iperf server CPU is the bottleneck due to 
virtio interrupt load.
- Ilya-3 does not provide any benefit.
- With 50us time-based batching the PMD load reduces by 1/3rd (less kicks to 
the virtio eventfd).
- The iperf throughput increases by 15%, still limited by the vCPU capacity. 
But the bottleneck moves from the virtio interrupt handlers in the guest kernel 
to the TCP stack and iperf process. With multiple threads can fully load the 
10G physical link.
- As expected the RTT latency increases by 190 ~= 4*50 us (2 OVS hops on server 
and client side)
- There is no significant difference between the CLOCK_MONOTONIC and the 
TSC-based implementations.


Application 2: dpdk pktgen as representative for DPDK application:

OVS version  max-latency  Mpps   Avg.PMD cycles/pkt  PMD utilization
----------------------------------------------------------------------
Master       n/a          3.92        305.43         99.65%
Ilya-3       n/a          3.84        310.58         99.31%
Ilya-6       0 us         3.82        312.47         99.67%
Ilya-6       50 us        3.80        314.60         99.65%
Ilya-4-Jan   50 us        3.78        313.65         98.86%

Discussion:
- For DPDK applications in the VM Tx batching does not provide any throughput 
benefit.
- At full PMD load the output batching overhead causes a capacity drop of 2-3%.
- There is no significant difference between CLOCK_MONOTONIC and TSC 
implementations.
- perf top measurements indicate that the clock_gettime system call eats about 
0.6% of the PMD cycles. This appears not enough to replace it by some TSC-based 
time implementation.

A zip file with the detailed measurement results can be downloaded from 
https://drive.google.com/open?id=0ByBuumQUR_NYNlRzbUhJX2R6NW8


Conclusions: 
-----------------
1. Time based Tx-batching provides significant performance improvements for 
kernel-based applications.
2. DPDK applications do not benefit in throughput but suffer from the latency 
increase.
3. The worst case overhead implied by Tx batching is about 3% and should be 
acceptable.
4. As there is the obvious trade-off between throughput improvement and latency 
increase, the maximum output latency should be a configuration option. Ideally 
OVS should have a default parameter per switch and an additional parameter per 
interface to override the default parameter.
5. Ilya's CLOCK_MONOTONIC implementation seems efficient enough. No urgent need 
to go replace this by some TSC-based clock.

Regards, Jan and Vishal

> -----Original Message-----
> From: Ilya Maximets [mailto:[email protected]]
> Sent: Monday, 14 August, 2017 14:10
> To: [email protected]; Jan Scheurich
> <[email protected]>
> Cc: Bhanuprakash Bodireddy <[email protected]>;
> Heetae Ahn <[email protected]>; Vishal Deep Ajmera
> <[email protected]>; Ilya Maximets
> <[email protected]>
> Subject: [PATCH RFC 2/2] dpif-netdev: Use microseconds granularity for
> output-max-latency.
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH RFC 2/2] dpif-netdev: Use microseconds granularity for output-max-latency.

Reply via email to