On 20.12.2016 15:19, Bodireddy, Bhanuprakash wrote:
>> -----Original Message-----
>> From: Ilya Maximets [mailto:[email protected]]
>> Sent: Tuesday, December 20, 2016 8:09 AM
>> To: Bodireddy, Bhanuprakash <[email protected]>; Aaron
>> Conole <[email protected]>
>> Cc: [email protected]; Daniele Di Proietto <[email protected]>;
>> Thadeu Lima de Souza Cascardo <[email protected]>; Fischetti, Antonio
>> <[email protected]>; Markus Magnusson
>> <[email protected]>
>> Subject: Re: [ovs-dev] [PATCH] netdev-dpdk: Use intermediate queue during
>> packet transmission.
>>
>> On 19.12.2016 21:05, Bodireddy, Bhanuprakash wrote:
>>> Thanks Ilya and Aaron for reviewing this patch and providing your
>> comments, my reply inline.
>>>
>>>> -----Original Message-----
>>>> From: Ilya Maximets [mailto:[email protected]]
>>>> Sent: Monday, December 19, 2016 8:41 AM
>>>> To: Aaron Conole <[email protected]>; Bodireddy, Bhanuprakash
>>>> <[email protected]>
>>>> Cc: [email protected]; Daniele Di Proietto
>>>> <[email protected]>; Thadeu Lima de Souza Cascardo
>>>> <[email protected]>; Fischetti, Antonio
>>>> <[email protected]>; Markus Magnusson
>>>> <[email protected]>
>>>> Subject: Re: [ovs-dev] [PATCH] netdev-dpdk: Use intermediate queue
>>>> during packet transmission.
>>>>
>>>> Hi,
>>>>
>>>> I didn't test this patch yet. Bhanu, could you please describe your
>>>> test scenario and performance results in more details.
>>>
>>> During the recent performance analysis improvements for classifier, we
>> found that bottleneck was also observed at flow batching.
>>> This was due to fewer packets in batch. To reproduce this, a simple P2P test
>> case can be used with 30 IXIA streams and matching IP flow rules.
>>> Eg:  For an IXIA stream with src Ip: 2.2.2.1, dst tip: 5.5.5.1
>>>         ovs-ofctl add-flow br0
>>> dl_type=0x0800,nw_src=2.2.2.1,actions=output:2
>>>
>>>         For an IXIA stream with src Ip: 4.4.4.1, dst tip: 15.15.15.1
>>>         ovs-ofctl add-flow br0
>>> dl_type=0x0800,nw_src=4.4.4.1,actions=output:2
>>>
>>> This leaves fewer packets in batches and packet_batch_per_flow_execute()
>> shall be invoked for every batch.
>>> With 30 flows, I see the throughput drops to ~7.0 Mpps in PHY2PHY case for
>> 64 byte udp packets.
>>>
>>>> It'll be nice if you provide throughput and latency measurement
>>>> results for different scenarios and packet sizes. Latency is important 
>>>> here.
>>> We are yet to do latency measurements in this case. With 30 IXIA
>>> streams comprising of 64 byte udp packets there was an throughput
>>> improvement of 30% in P2P case and 13-15% in PVP case(single queue). we
>> will try to get the latency stats with and without this patch.
>>>
>>>>
>>>> About the patch itself:
>>>>
>>>>  1. 'drain' called only for PMD threads. This will lead to
>>>>     broken connection with non-PMD ports.
>>> I see that non-PMD ports are handled with vswitchd thread.
>>> Tested PVP loopback case with tap ports and found to be working as
>>> expected. Can you let me know the specific case you are referring here so
>> that I can verify if the patch breaks it.
>>
>> I meant something like this:
>>
>>
>> *-----------HOST-1(br-int)-----*               
>> *---------HOST-2(br-int)------*
>> |                              |               |                             
>> |
>> |   internal_port <--> dpdk0 <-------------------> dpdk0 <--> internal_port  
>> |
>> |   192.168.0.1/24             |               |              192.168.0.2/24 
>> |
>> *------------------------------*               
>> *-----------------------------*
>>
>> (HOST-1)# ping 192.168.0.2
>>
>> In this case I'm expecting that first (NETDEV_MAX_BURST - 1) icmp packets
>> will stuck in TX queue. The next packet will cause sending the whole batch of
>> NETDEV_MAX_BURST icmp packets. That is not good behaviour.
> 
> Thanks for test case. I tested this and found to be working with the patch. 
> The reason being, PMD threads uses intermediate queue implementation by 
> invoking 'netdev_dpdk_eth_tx_queue()', 
> Whereas the non-pmd thread(vswitchd) continues to call 
> netdev_dpdk_eth_tx_burst() in the patch and no change in previous behavior 
> here.
> 
> I had setup the case similar to one mentioned in above diagram using 2 hosts 
> and assigning IP addresses to the respective bridges. 
> When I ping the  Host2(192.168.0.2) from Host 1(192.168.0.1) below is the 
> call path.
> 
> (HOST-1)# ping 192.168.0.2 -c 1  (send only one packet)
> 
> When sending ICMP echo request packet: 
> Vswitchd thread:    dp_execute_cb()
>                                            netdev_send()           [ pkt sent 
> on dpdk0, qid: 0]
>                                                  netdev_dpdk_send__() 
>                                                        dpdk_do_tx_copy() 
>                                                              
> netdev_dpdk_eth_tx_burst().
>                                                                           
> 
> On receiving ICMP echo reply packet:
> PMD thread:   dp_execute_cb()
>                                     netdev_send()    [pkt sent on br0, qid: 1]
>                                          netdev_linux_send()
> 

OK. Maybe in this exact case it works, but it isn't guaranteed that non-PMD
threads will always call send with 'may_steal=false'.

Also, I see another issue here:

    8. netdev_send(.., may_steal=true, ...) uses instant send while
       netdev_send(.., may_steal=false, ...) uses queueing of packets.
       This behaviour will lead to packet reordering issues in some
       cases.

    9. You're decreasing 'txq->count' inside 'netdev_dpdk_eth_tx_burst()'
       while it can be called with different from 'txq->burst_pkts'
       argument. Some of queued packets will be lost + memory leak.
           
Best regards, Ilya Maximets.

> 
>>
>>>>  2. 'xps_get_tx_qid()' called twice. First time on send and
>>>>     the second time on drain. This may lead to different
>>>>     returned 'tx_qid's and packets will stuck forever in
>>>>     tx buffer.
>>>
>>> You are right and xps_get_tx_qid() can return different tx_qids.
>>>  I was always testing for XPS disabled cases and completely overlooked this
>> case.
>>> This could be a potential problem for us and any suggestions should be very
>> helpful.
>>>
>>>>
>>>>  3. 'txq_drain()' must take the 'tx_lock' for queue in case
>>>>     of dynamic tx queues.
>>>
>>> Agree, will handle this in next version.
>>>
>>>>
>>>>  4. Waiting for 1024 polling cycles of PMD thread may cause
>>>>     a huge latency if we have few packets per second on one
>>>>     port and intensive traffic on others.
>>>
>>> I agree with you. We discussed this here and thought invoking the
>>> drain logic once every 1024 cycles is an optimal solution. But if the
>>> community thinks otherwise we can move this in to the main 'for' loop
>>> so that it can be invoked more often provided that 'DRAIN_TSC' cycles
>> elapsed.
>>>
>>>>
>>>>  5. This patch breaks the counting of 'tx_dropped' packets
>>>>     in netdev-dpdk.
>>>
>>> I presume you are referring to below line in the code.
>>>
>>> - dropped += netdev_dpdk_eth_tx_burst()
>>> + netdev_dpdk_eth_tx_queue()
>>>
>>> Will handle this in v2 of this patch.
>>>
>>>>
>>>>  6. Comments in netdev-provider.h should be fixed to reflect
>>>>     all the changes.
>>>
>>> Will make necessary changes in netdev-provider.h
>>>
>>>>
>>>>  7. At last, I agree with Aaron that explicit allowing only
>>>>     'dpdk' ports is not a good style. Also, mentioning name of
>>>>     exact netdev inside the common code is a bad style too.
>>>
>>> Will handle this appropriately.
>>>
>>> Regards,
>>> Bhanu Prakash.
>>>
>>>> Best regards, Ilya Maximets.
>>>>
>>>> On 16.12.2016 22:24, Aaron Conole wrote:
>>>>> Hi Bhanu,
>>>>>
>>>>> Bhanuprakash Bodireddy <[email protected]> writes:
>>>>>
>>>>>> In exact match cache processing on an EMC hit, packets are queued
>>>>>> in to batches matching the flow. Thereafter, packets are processed
>>>>>> in batches for faster packet processing. This particularly is
>>>>>> inefficient if there are fewer packets in a batch as
>>>>>> rte_eth_tx_burst() incurs expensive MMIO write.
>>>>>>
>>>>>> This commit adds back intermediate queue implementation. Packets
>>>>>> are queued and burst when the packet count >= NETDEV_MAX_BURST.
>>>>>> Also drain logic is refactored to handle fewer packets in the tx queues.
>>>>>> Testing shows significant performance gains with queueing.
>>>>>>
>>>>>> Fixes: b59cc14e032d("netdev-dpdk: Use instant sending instead of
>>>>>> queueing of packets.")
>>>>>> Signed-off-by: Bhanuprakash Bodireddy
>>>>>> <[email protected]>
>>>>>> Signed-off-by: Antonio Fischetti <[email protected]>
>>>>>> Co-authored-by: Antonio Fischetti <[email protected]>
>>>>>> Signed-off-by: Markus Magnusson
>> <[email protected]>
>>>>>> Co-authored-by: Markus Magnusson
>> <[email protected]>
>>>>>> ---
>>>>>
>>>>> I've Cc'd Ilya just in hopes that the patch gets a better review
>>>>> than I could give.  As a general comment, I like the direction -
>>>>> batched operations are usually a better way of going.
>>>>>
>>>>> Just minor below.
>>>>>
>>>>> ... snip ...
>>>>>>
>>>>>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
>>>>>> 3509493..65dff83 100644
>>>>>> --- a/lib/dpif-netdev.c
>>>>>> +++ b/lib/dpif-netdev.c
>>>>>> @@ -622,6 +622,9 @@ static int dpif_netdev_xps_get_tx_qid(const
>>>>>> struct dp_netdev_pmd_thread *pmd,  static inline bool
>>>>>> emc_entry_alive(struct emc_entry *ce);  static void
>>>>>> emc_clear_entry(struct emc_entry *ce);
>>>>>>
>>>>>> +static struct tx_port *pmd_send_port_cache_lookup
>>>>>> +    (const struct dp_netdev_pmd_thread *pmd, odp_port_t port_no);
>>>>>> +
>>>>>>  static void
>>>>>>  emc_cache_init(struct emc_cache *flow_cache)  { @@ -2877,6
>>>>>> +2880,31 @@ cycles_count_end(struct dp_netdev_pmd_thread *pmd,
>> }
>>>>>>
>>>>>>  static void
>>>>>> +dp_netdev_drain_txq_port(struct dp_netdev_pmd_thread *pmd,
>>>>>> +                         struct dp_netdev_port *port,
>>>>>> +                         uint64_t now) {
>>>>>> +    int tx_qid;
>>>>>> +
>>>>>> +    if (!strcmp(port->type, "dpdk")) {
>>>>>
>>>>> Any reason to restrict this only to dpdk ports?  It looks like
>>>>> you've added a new netdev operation, so why not just call the
>>>>> netdev_txq_drain unconditionally?
>>>>>
>>>>> Also, bit of a nit, but tq_qid can be reduced in scope down to the
>>>>> if block below.
>>>>>
>>>>>> +        struct tx_port *tx = pmd_send_port_cache_lookup(pmd,
>>>>>> +                 u32_to_odp(port->port_no));
>>>>>> +
>>>>>> +        if (OVS_LIKELY(tx)) {
>>>>>> +            bool dynamic_txqs = tx->port->dynamic_txqs;
>>>>>> +
>>>>>> +            if (dynamic_txqs) {
>>>>>> +                tx_qid = dpif_netdev_xps_get_tx_qid(pmd, tx, now);
>>>>>> +            } else {
>>>>>> +                tx_qid = pmd->static_tx_qid;
>>>>>> +            }
>>>>>> +
>>>>>> +            netdev_txq_drain(port->netdev, tx_qid);
>>>>>> +        }
>>>>>> +    }
>>>>>> +}
>>>>>> +
>>>>>
>>>>>
>>>>>
>>>
>>>
>>>
> 
> 
> 
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to