Hi Ilya, thanks for your detailed feedback. I've added a couple of replies inline about when the instant send happens and the may_steal param.
Regards, Antonio > -----Original Message----- > From: Ilya Maximets [mailto:[email protected]] > Sent: Tuesday, December 20, 2016 12:47 PM > To: Bodireddy, Bhanuprakash <[email protected]>; Aaron > Conole <[email protected]> > Cc: [email protected]; Daniele Di Proietto <[email protected]>; > Thadeu Lima de Souza Cascardo <[email protected]>; Fischetti, Antonio > <[email protected]>; Markus Magnusson > <[email protected]> > Subject: Re: [ovs-dev] [PATCH] netdev-dpdk: Use intermediate queue during > packet transmission. > > On 20.12.2016 15:19, Bodireddy, Bhanuprakash wrote: > >> -----Original Message----- > >> From: Ilya Maximets [mailto:[email protected]] > >> Sent: Tuesday, December 20, 2016 8:09 AM > >> To: Bodireddy, Bhanuprakash <[email protected]>; Aaron > >> Conole <[email protected]> > >> Cc: [email protected]; Daniele Di Proietto <[email protected]>; > >> Thadeu Lima de Souza Cascardo <[email protected]>; Fischetti, Antonio > >> <[email protected]>; Markus Magnusson > >> <[email protected]> > >> Subject: Re: [ovs-dev] [PATCH] netdev-dpdk: Use intermediate queue > during > >> packet transmission. > >> > >> On 19.12.2016 21:05, Bodireddy, Bhanuprakash wrote: > >>> Thanks Ilya and Aaron for reviewing this patch and providing your > >> comments, my reply inline. > >>> > >>>> -----Original Message----- > >>>> From: Ilya Maximets [mailto:[email protected]] > >>>> Sent: Monday, December 19, 2016 8:41 AM > >>>> To: Aaron Conole <[email protected]>; Bodireddy, Bhanuprakash > >>>> <[email protected]> > >>>> Cc: [email protected]; Daniele Di Proietto > >>>> <[email protected]>; Thadeu Lima de Souza Cascardo > >>>> <[email protected]>; Fischetti, Antonio > >>>> <[email protected]>; Markus Magnusson > >>>> <[email protected]> > >>>> Subject: Re: [ovs-dev] [PATCH] netdev-dpdk: Use intermediate queue > >>>> during packet transmission. > >>>> > >>>> Hi, > >>>> > >>>> I didn't test this patch yet. Bhanu, could you please describe your > >>>> test scenario and performance results in more details. > >>> > >>> During the recent performance analysis improvements for classifier, we > >> found that bottleneck was also observed at flow batching. > >>> This was due to fewer packets in batch. To reproduce this, a simple > P2P test > >> case can be used with 30 IXIA streams and matching IP flow rules. > >>> Eg: For an IXIA stream with src Ip: 2.2.2.1, dst tip: 5.5.5.1 > >>> ovs-ofctl add-flow br0 > >>> dl_type=0x0800,nw_src=2.2.2.1,actions=output:2 > >>> > >>> For an IXIA stream with src Ip: 4.4.4.1, dst tip: 15.15.15.1 > >>> ovs-ofctl add-flow br0 > >>> dl_type=0x0800,nw_src=4.4.4.1,actions=output:2 > >>> > >>> This leaves fewer packets in batches and > packet_batch_per_flow_execute() > >> shall be invoked for every batch. > >>> With 30 flows, I see the throughput drops to ~7.0 Mpps in PHY2PHY case > for > >> 64 byte udp packets. > >>> > >>>> It'll be nice if you provide throughput and latency measurement > >>>> results for different scenarios and packet sizes. Latency is > important here. > >>> We are yet to do latency measurements in this case. With 30 IXIA > >>> streams comprising of 64 byte udp packets there was an throughput > >>> improvement of 30% in P2P case and 13-15% in PVP case(single queue). > we > >> will try to get the latency stats with and without this patch. > >>> > >>>> > >>>> About the patch itself: > >>>> > >>>> 1. 'drain' called only for PMD threads. This will lead to > >>>> broken connection with non-PMD ports. > >>> I see that non-PMD ports are handled with vswitchd thread. > >>> Tested PVP loopback case with tap ports and found to be working as > >>> expected. Can you let me know the specific case you are referring here > so > >> that I can verify if the patch breaks it. > >> > >> I meant something like this: > >> > >> > >> *-----------HOST-1(br-int)-----* *---------HOST-2(br- > int)------* > >> | | | > | > >> | internal_port <--> dpdk0 <-------------------> dpdk0 <--> > internal_port | > >> | 192.168.0.1/24 | | > 192.168.0.2/24 | > >> *------------------------------* *----------------------- > ------* > >> > >> (HOST-1)# ping 192.168.0.2 > >> > >> In this case I'm expecting that first (NETDEV_MAX_BURST - 1) icmp > packets > >> will stuck in TX queue. The next packet will cause sending the whole > batch of > >> NETDEV_MAX_BURST icmp packets. That is not good behaviour. > > > > Thanks for test case. I tested this and found to be working with the > patch. > > The reason being, PMD threads uses intermediate queue implementation by > invoking 'netdev_dpdk_eth_tx_queue()', > > Whereas the non-pmd thread(vswitchd) continues to call > netdev_dpdk_eth_tx_burst() in the patch and no change in previous behavior > here. > > > > I had setup the case similar to one mentioned in above diagram using 2 > hosts and assigning IP addresses to the respective bridges. > > When I ping the Host2(192.168.0.2) from Host 1(192.168.0.1) below is > the call path. > > > > (HOST-1)# ping 192.168.0.2 -c 1 (send only one packet) > > > > When sending ICMP echo request packet: > > Vswitchd thread: dp_execute_cb() > > netdev_send() [ pkt > sent on dpdk0, qid: 0] > > netdev_dpdk_send__() > > dpdk_do_tx_copy() > > > netdev_dpdk_eth_tx_burst(). > > > > > > On receiving ICMP echo reply packet: > > PMD thread: dp_execute_cb() > > netdev_send() [pkt sent on br0, > qid: 1] > > netdev_linux_send() > > > > OK. Maybe in this exact case it works, but it isn't guaranteed that non- > PMD > threads will always call send with 'may_steal=false'. > [Antonio F] In my test case - below the details - the non-PMD thread is calling send with 'may_steal=true' and the packet is sent instantly. Regardless of the may_steal value the netdev_dpdk_send__() function calls dpdk_do_tx_copy() for non-PMD threads because the condition (batch->packets[0]->source != DPBUF_DPDK) is true, as buffer data is not from DPDK allocated memory. In my test case I have one host with - 2 dpdk ports dpdk0, dpdk1, - an IP address is assigned to the internal port 'br0' and - 1 single flow is set as ovs-ofctl add-flow br0 actions=NORMAL When an ARP request is sent from an ixia generator to dpdk0 1. the PMD thread forwards it to the internal br0 with may_steal=0, via netdev_linux_send() 2. the PMD thread also forwards the ARP request to dpdk1 with may_steal=1. As buffer data is from dpdk allocated mem if (OVS_UNLIKELY(!may_steal || batch->packets[0]->source != DPBUF_DPDK)) is false => ARP request is put into the queue. When an ARP reply is received from br0: 1. the non-PMD thread forwards it to dpdk0 with may_steal=1. As buffer data is NOT from dpdk allocated mem the following if (OVS_UNLIKELY(!may_steal || batch->packets[0]->source != DPBUF_DPDK)) is true => dpdk_do_tx_copy() is called and the packet is sent out immediately. > Also, I see another issue here: > > 8. netdev_send(.., may_steal=true, ...) uses instant send while > netdev_send(.., may_steal=false, ...) uses queueing of packets. > This behaviour will lead to packet reordering issues in some > cases. [Antonio F] The instant send will happen when in netdev_dpdk_send__() the condition if (OVS_UNLIKELY(!may_steal || batch->packets[0]->source != DPBUF_DPDK)) is true => dpdk_do_tx_copy() => netdev_dpdk_eth_tx_burst() is called. Can you please provide more details about the packet reordering issue you mentioned? > > 9. You're decreasing 'txq->count' inside 'netdev_dpdk_eth_tx_burst()' > while it can be called with different from 'txq->burst_pkts' > argument. Some of queued packets will be lost + memory leak. > > Best regards, Ilya Maximets. > > > > >> > >>>> 2. 'xps_get_tx_qid()' called twice. First time on send and > >>>> the second time on drain. This may lead to different > >>>> returned 'tx_qid's and packets will stuck forever in > >>>> tx buffer. > >>> > >>> You are right and xps_get_tx_qid() can return different tx_qids. > >>> I was always testing for XPS disabled cases and completely overlooked > this > >> case. > >>> This could be a potential problem for us and any suggestions should be > very > >> helpful. > >>> > >>>> > >>>> 3. 'txq_drain()' must take the 'tx_lock' for queue in case > >>>> of dynamic tx queues. > >>> > >>> Agree, will handle this in next version. > >>> > >>>> > >>>> 4. Waiting for 1024 polling cycles of PMD thread may cause > >>>> a huge latency if we have few packets per second on one > >>>> port and intensive traffic on others. > >>> > >>> I agree with you. We discussed this here and thought invoking the > >>> drain logic once every 1024 cycles is an optimal solution. But if the > >>> community thinks otherwise we can move this in to the main 'for' loop > >>> so that it can be invoked more often provided that 'DRAIN_TSC' cycles > >> elapsed. > >>> > >>>> > >>>> 5. This patch breaks the counting of 'tx_dropped' packets > >>>> in netdev-dpdk. > >>> > >>> I presume you are referring to below line in the code. > >>> > >>> - dropped += netdev_dpdk_eth_tx_burst() > >>> + netdev_dpdk_eth_tx_queue() > >>> > >>> Will handle this in v2 of this patch. > >>> > >>>> > >>>> 6. Comments in netdev-provider.h should be fixed to reflect > >>>> all the changes. > >>> > >>> Will make necessary changes in netdev-provider.h > >>> > >>>> > >>>> 7. At last, I agree with Aaron that explicit allowing only > >>>> 'dpdk' ports is not a good style. Also, mentioning name of > >>>> exact netdev inside the common code is a bad style too. > >>> > >>> Will handle this appropriately. > >>> > >>> Regards, > >>> Bhanu Prakash. > >>> > >>>> Best regards, Ilya Maximets. > >>>> > >>>> On 16.12.2016 22:24, Aaron Conole wrote: > >>>>> Hi Bhanu, > >>>>> > >>>>> Bhanuprakash Bodireddy <[email protected]> writes: > >>>>> > >>>>>> In exact match cache processing on an EMC hit, packets are queued > >>>>>> in to batches matching the flow. Thereafter, packets are processed > >>>>>> in batches for faster packet processing. This particularly is > >>>>>> inefficient if there are fewer packets in a batch as > >>>>>> rte_eth_tx_burst() incurs expensive MMIO write. > >>>>>> > >>>>>> This commit adds back intermediate queue implementation. Packets > >>>>>> are queued and burst when the packet count >= NETDEV_MAX_BURST. > >>>>>> Also drain logic is refactored to handle fewer packets in the tx > queues. > >>>>>> Testing shows significant performance gains with queueing. > >>>>>> > >>>>>> Fixes: b59cc14e032d("netdev-dpdk: Use instant sending instead of > >>>>>> queueing of packets.") > >>>>>> Signed-off-by: Bhanuprakash Bodireddy > >>>>>> <[email protected]> > >>>>>> Signed-off-by: Antonio Fischetti <[email protected]> > >>>>>> Co-authored-by: Antonio Fischetti <[email protected]> > >>>>>> Signed-off-by: Markus Magnusson > >> <[email protected]> > >>>>>> Co-authored-by: Markus Magnusson > >> <[email protected]> > >>>>>> --- > >>>>> > >>>>> I've Cc'd Ilya just in hopes that the patch gets a better review > >>>>> than I could give. As a general comment, I like the direction - > >>>>> batched operations are usually a better way of going. > >>>>> > >>>>> Just minor below. > >>>>> > >>>>> ... snip ... > >>>>>> > >>>>>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index > >>>>>> 3509493..65dff83 100644 > >>>>>> --- a/lib/dpif-netdev.c > >>>>>> +++ b/lib/dpif-netdev.c > >>>>>> @@ -622,6 +622,9 @@ static int dpif_netdev_xps_get_tx_qid(const > >>>>>> struct dp_netdev_pmd_thread *pmd, static inline bool > >>>>>> emc_entry_alive(struct emc_entry *ce); static void > >>>>>> emc_clear_entry(struct emc_entry *ce); > >>>>>> > >>>>>> +static struct tx_port *pmd_send_port_cache_lookup > >>>>>> + (const struct dp_netdev_pmd_thread *pmd, odp_port_t port_no); > >>>>>> + > >>>>>> static void > >>>>>> emc_cache_init(struct emc_cache *flow_cache) { @@ -2877,6 > >>>>>> +2880,31 @@ cycles_count_end(struct dp_netdev_pmd_thread *pmd, > >> } > >>>>>> > >>>>>> static void > >>>>>> +dp_netdev_drain_txq_port(struct dp_netdev_pmd_thread *pmd, > >>>>>> + struct dp_netdev_port *port, > >>>>>> + uint64_t now) { > >>>>>> + int tx_qid; > >>>>>> + > >>>>>> + if (!strcmp(port->type, "dpdk")) { > >>>>> > >>>>> Any reason to restrict this only to dpdk ports? It looks like > >>>>> you've added a new netdev operation, so why not just call the > >>>>> netdev_txq_drain unconditionally? > >>>>> > >>>>> Also, bit of a nit, but tq_qid can be reduced in scope down to the > >>>>> if block below. > >>>>> > >>>>>> + struct tx_port *tx = pmd_send_port_cache_lookup(pmd, > >>>>>> + u32_to_odp(port->port_no)); > >>>>>> + > >>>>>> + if (OVS_LIKELY(tx)) { > >>>>>> + bool dynamic_txqs = tx->port->dynamic_txqs; > >>>>>> + > >>>>>> + if (dynamic_txqs) { > >>>>>> + tx_qid = dpif_netdev_xps_get_tx_qid(pmd, tx, now); > >>>>>> + } else { > >>>>>> + tx_qid = pmd->static_tx_qid; > >>>>>> + } > >>>>>> + > >>>>>> + netdev_txq_drain(port->netdev, tx_qid); > >>>>>> + } > >>>>>> + } > >>>>>> +} > >>>>>> + > >>>>> > >>>>> > >>>>> > >>> > >>> > >>> > > > > > > _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
