Hi Ilya,

I have spent some more time on analyzing and thinking through your latest 
propose patch set for time-based Tx batching:

> (Ilya-6):     Time-based output batching with us resolution using 
> CLOCK_MONOTONIC
>               (master) + [PATCH v3 1-3/4] Output packet batching +
>               [PATCH RFC v3 4/4] dpif-netdev: Time based output batching +
>               [PATCH RFC 1/2] timeval: Introduce time_usec() +
>               [PATCH RFC 2/2] dpif-netdev: Use microseconds granularity for 
> output-max-latency.

I would like to suggest that you re-spin a new version where you integrate the 
last three RFC patches as non-RFC with the following changes/additions:

1. Fold-in patch http://patchwork.ozlabs.org/patch/800276/ (dpif-netdev: Keep 
latest measured time for PMD thread) to store the time in us resolution in the 
PMD struct. That may seem a small optimization but makes the code so much 
cleaner and will help avoid unnecessary extra system calls to read 
CLOCK_MONOTONIC.

2. Don't set port->output_time when you enqueue a new batch to an output port 
in function dp_execute_cb(), but when you actually send a batch to the netdev 
in dp_netdev_pmd_flush_output_on_port(). This still ensures we don't flush more 
frequently than specified in cur_max_latency (unless the batch size limit is 
reached), but we can avoid any unnecessary delay when packets are received in 
intervals larger than cur_max_latency (at 50 us this would be the case for 
packet rates below 20Kpps!). In this case each packet (batch) would be flushed 
immediately at the end of each iteration as in non-time based tx batching.

In this context it might be good to rename the configuration parameter to 
something like "tx-batch-gap".

3. Considering that time-based tx batching is beneficial if and only if the 
guest virtio driver is interrupt-based, I believe it would be best if OVS 
automatically applied time-based tx batching for vhostuser tx queues where the 
driver has requested interrupts. Unfortunately this information is today hidden 
deep in DPDK's rte_vhost library (file virtio_net.c)

        /* Kick the guest if necessary. */
        if (!(vq->avail->flags & VRING_AVAIL_F_NO_INTERRUPT)
                        && (vq->callfd >= 0))
                eventfd_write(vq->callfd, (eventfd_t)1);
        return count;

So to automate this, we'd need a new library function in rte_vhost for OVS to 
be able to query this queue property. Perhaps it is not too late to get this 
into DPDK 17.11. Interaction with vhostuser PMD?

Having to configure time-based tx batching per port is only a second best 
option. Nova in OpenStack , for example, does not have the knowledge if 
time-based tx batching is appropriate for a vhostuser port and there is no 
Neutron port attribute today that would help determining that.

Thanks, Jan


> -----Original Message-----
> From: ovs-dev-boun...@openvswitch.org 
> [mailto:ovs-dev-boun...@openvswitch.org] On Behalf Of Jan Scheurich
> Sent: Saturday, 02 September, 2017 17:14
> To: d...@openvswitch.org; Ilya Maximets <i.maxim...@samsung.com>
> Subject: Re: [ovs-dev] [PATCH RFC 2/2] dpif-netdev: Use microseconds 
> granularity for output-max-latency.
> 
> Hi,
> 
> Vishal and I have been benchmarking the impact of the several Tx-batching 
> patches on the performance of OVS in the phy-VM-phy
> scenario with different applications in the VM:
> 
> The OVS versions we tested are:
> 
> (master):     OVS master (
> (Ilya-3):     Output batching within one Rx batch :
>               (master) + [PATCH v3 1-3/4] Output packet batching
> (Ilya-6):     Time-based output batching with us resolution using 
> CLOCK_MONOTONIC
>               (Ilya-3) +  [PATCH RFC v3 4/4] dpif-netdev: Time based output 
> batching +
>               [PATCH RFC 1/2] timeval: Introduce time_usec() +
>               [PATCH RFC 2/2] dpif-netdev: Use microseconds granularity for 
> output-max-latency.
> (Ilya-4-Jan): Time-based output batching with us resolution using TSC cycles
>               (Ilya-3) +  [PATCH RFC v3 4/4] dpif-netdev: Time based output 
> batching +
>               Incremental patch using TSC cycles in
>               
> https://mail.openvswitch.org/pipermail/ovs-dev/2017-August/337402.html
> 
> Application 1: iperf server as representative for kernel applications:
> 
> The iperf server executes in a VM with 2 vCPUs where both virtio interrupts 
> and iperf process are pinned to the same vCPU for best
> performance. The iperf client also runs in a VM on a different server. OVS 
> nodes on client and server are configured identically.
> 
>                 Iperf                                 iperf CPU  Ping
> OVS version      Gbps   Avg.PMD cycles/pkt  PMD util  host util  rtt
> ------------------------------------------------------------------------
> Master           6.83        1708.63        43.50%      100%     39 us
> Ilya-3           6.88        1951.35        47.17%      100%     40 us
> Ilya-6 50 us     7.83        1049.21        31.74%      99.7%   228 us
> Ilya-4-Jan 50 us 7.75        1086.2         30.65%      99.7%   230 us
> 
> Discussion:
> - Without time-based Tx batching the iperf server CPU is the bottleneck due 
> to virtio interrupt load.
> - Ilya-3 does not provide any benefit.
> - With 50us time-based batching the PMD load reduces by 1/3rd (less kicks to 
> the virtio eventfd).
> - The iperf throughput increases by 15%, still limited by the vCPU capacity. 
> But the bottleneck moves from the virtio interrupt handlers
> in the guest kernel to the TCP stack and iperf process. With multiple threads 
> can fully load the 10G physical link.
> - As expected the RTT latency increases by 190 ~= 4*50 us (2 OVS hops on 
> server and client side)
> - There is no significant difference between the CLOCK_MONOTONIC and the 
> TSC-based implementations.
> 
> 
> Application 2: dpdk pktgen as representative for DPDK application:
> 
> OVS version  max-latency  Mpps   Avg.PMD cycles/pkt  PMD utilization
> ----------------------------------------------------------------------
> Master       n/a          3.92        305.43         99.65%
> Ilya-3       n/a          3.84        310.58         99.31%
> Ilya-6       0 us         3.82        312.47         99.67%
> Ilya-6       50 us        3.80        314.60         99.65%
> Ilya-4-Jan   50 us        3.78        313.65         98.86%
> 
> Discussion:
> - For DPDK applications in the VM Tx batching does not provide any throughput 
> benefit.
> - At full PMD load the output batching overhead causes a capacity drop of 
> 2-3%.
> - There is no significant difference between CLOCK_MONOTONIC and TSC 
> implementations.
> - perf top measurements indicate that the clock_gettime system call eats 
> about 0.6% of the PMD cycles. This appears not enough to
> replace it by some TSC-based time implementation.
> 
> A zip file with the detailed measurement results can be downloaded from
> https://drive.google.com/open?id=0ByBuumQUR_NYNlRzbUhJX2R6NW8
> 
> 
> Conclusions:
> -----------------
> 1. Time based Tx-batching provides significant performance improvements for 
> kernel-based applications.
> 2. DPDK applications do not benefit in throughput but suffer from the latency 
> increase.
> 3. The worst case overhead implied by Tx batching is about 3% and should be 
> acceptable.
> 4. As there is the obvious trade-off between throughput improvement and 
> latency increase, the maximum output latency should be a
> configuration option. Ideally OVS should have a default parameter per switch 
> and an additional parameter per interface to override
> the default parameter.
> 5. Ilya's CLOCK_MONOTONIC implementation seems efficient enough. No urgent 
> need to go replace this by some TSC-based clock.
> 
> Regards, Jan and Vishal
> 
> > -----Original Message-----
> > From: Ilya Maximets [mailto:i.maxim...@samsung.com]
> > Sent: Monday, 14 August, 2017 14:10
> > To: ovs-dev@openvswitch.org; Jan Scheurich
> > <jan.scheur...@ericsson.com>
> > Cc: Bhanuprakash Bodireddy <bhanuprakash.bodire...@intel.com>;
> > Heetae Ahn <heetae82....@samsung.com>; Vishal Deep Ajmera
> > <vishal.deep.ajm...@ericsson.com>; Ilya Maximets
> > <i.maxim...@samsung.com>
> > Subject: [PATCH RFC 2/2] dpif-netdev: Use microseconds granularity for
> > output-max-latency.
> _______________________________________________
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to