On 11/01/2019 16:02, Ian Stokes wrote:
> On 1/10/2019 4:58 PM, Tiago Lam wrote:
>> TCP Segmentation Offload (TSO) is a feature which enables the TCP/IP
>> network stack to delegate segmentation of a TCP segment to the hardware
>> NIC, thus saving compute resources. This may improve performance
>> significantly for TCP workload in virtualized environments.
>>
>> While a previous commit already added the necesary logic to netdev-dpdk
>> to deal with packets marked for TSO, this set of changes enables TSO by
>> default when using multi-segment mbufs.
>>
>> Thus, to enable TSO on the physical DPDK interfaces, only the following
>> command needs to be issued before starting OvS:
>>      ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true
> 
> A general question (and not a blocker by any means), is it more 
> intuitive to change above to be 'other_config:dpdk-tso=true' or 
> something similar?
> 
> I'm thinking from a user perspective, they would be familiar with the 
> concept of TSO already but not necessarily multi segments. It seems to 
> be a technical per-requisite for DPDK (and OVS DPDK) under the hood.
> 
> But again multi segments will be needed for other offloads besides TSO 
> so maybe it makes more sense to stick with the argument you have already.

That was my first thought, and the  main reason why I went with
`dpdk-multi-seg-mbufs`. But I'm open to suggestions, as mentioned in the
first RFC of the series. If people think it's better to go with
`dpdk-tso` for now, I'd gladly change.

Thanks for the comments. I've replied to some in-line, but assume I'll
deal with ones I haven't comment as well.

> 
>>
>> Co-authored-by: Mark Kavanagh <[email protected]>
>>
>> Signed-off-by: Mark Kavanagh <[email protected]>
>> Signed-off-by: Tiago Lam <[email protected]>
>> ---
>>   Documentation/automake.mk           |   1 +
>>   Documentation/topics/dpdk/index.rst |   1 +
>>   Documentation/topics/dpdk/tso.rst   | 111 
>> ++++++++++++++++++++++++++++++++++++
>>   NEWS                                |   1 +
>>   lib/netdev-dpdk.c                   |  63 +++++++++++++++++---
>>   5 files changed, 168 insertions(+), 9 deletions(-)
>>   create mode 100644 Documentation/topics/dpdk/tso.rst
>>
>> diff --git a/Documentation/automake.mk b/Documentation/automake.mk
>> index 082438e..a20deb8 100644
>> --- a/Documentation/automake.mk
>> +++ b/Documentation/automake.mk
>> @@ -39,6 +39,7 @@ DOC_SOURCE = \
>>      Documentation/topics/dpdk/index.rst \
>>      Documentation/topics/dpdk/bridge.rst \
>>      Documentation/topics/dpdk/jumbo-frames.rst \
>> +    Documentation/topics/dpdk/tso.rst \
>>      Documentation/topics/dpdk/memory.rst \
>>      Documentation/topics/dpdk/pdump.rst \
>>      Documentation/topics/dpdk/phy.rst \
>> diff --git a/Documentation/topics/dpdk/index.rst 
>> b/Documentation/topics/dpdk/index.rst
>> index cf24a7b..eb2a04d 100644
>> --- a/Documentation/topics/dpdk/index.rst
>> +++ b/Documentation/topics/dpdk/index.rst
>> @@ -40,4 +40,5 @@ The DPDK Datapath
>>      /topics/dpdk/qos
>>      /topics/dpdk/pdump
>>      /topics/dpdk/jumbo-frames
>> +   /topics/dpdk/tso
>>      /topics/dpdk/memory
>> diff --git a/Documentation/topics/dpdk/tso.rst 
>> b/Documentation/topics/dpdk/tso.rst
>> new file mode 100644
>> index 0000000..503354f
>> --- /dev/null
>> +++ b/Documentation/topics/dpdk/tso.rst
>> @@ -0,0 +1,111 @@
>> +..
>> +      Copyright 2018, Red Hat, Inc.
> 
> Probably Intel in this case rather than RH.

:)

> 
>> +
>> +      Licensed under the Apache License, Version 2.0 (the "License"); you 
>> may
>> +      not use this file except in compliance with the License. You may 
>> obtain
>> +      a copy of the License at
>> +
>> +          http://www.apache.org/licenses/LICENSE-2.0
>> +
>> +      Unless required by applicable law or agreed to in writing, software
>> +      distributed under the License is distributed on an "AS IS" BASIS, 
>> WITHOUT
>> +      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See 
>> the
>> +      License for the specific language governing permissions and 
>> limitations
>> +      under the License.
>> +
>> +      Convention for heading levels in Open vSwitch documentation:
>> +
>> +      =======  Heading 0 (reserved for the title in a document)
>> +      -------  Heading 1
>> +      ~~~~~~~  Heading 2
>> +      +++++++  Heading 3
>> +      '''''''  Heading 4
>> +
>> +      Avoid deeper levels because they do not render well.
>> +
>> +===
>> +TSO
>> +===
> You should probably add a notice that this feature is (experimental).
> 
>> +
>> +.. versionadded:: 2.11.0
> It's good practice to add the version added but traditionally we've done 
> this after the fact once the release is out. Other wise you can have 
> commits on master referring to a release that is not officially out yet.
> 
> This has come up before and in that case we waited for the release 
> before adding.
>> +
>> +TCP Segmentation Offload (TSO) is a mechanism which allows a TCP/IP stack to
>> +offload the TCP segmentation into hardware, thus saving the cycles that 
>> would
>> +be required to perform this same segmentation in Software.
> Minor: Is software capitalized for a reason above and throughout the 
> text below? I would have thought lower case would be fine.
> 
>> +
>> +TCP Segmentation Offload (TSO) enables a network stack to delegate 
>> segmentation
>> +of an oversized TCP segment to the underlying physical NIC. Offload of frame
>> +segmentation achieves computational savings in the core, freeing up CPU 
>> cycles
>> +for more useful work.
>> +
>> +A common use case for TSO is when using virtualization, where traffic that's
>> +coming in from a VM can offload the TCP segmentation, thus avoiding the
>> +fragmentation in Software. Additionally, if the traffic is headed to a VM
>> +within the same host further optimization can be expected. As the traffic 
>> never
>> +leaves the machine, no MTU needs to be accounted for, and thus no 
>> segmentation
>> +and checksum calculations are required, which saves yet more cycles. Only 
>> when
>> +the traffic actually leaves the host the segmentation needs to happen, in 
>> which
>> +case it will be performed by the egress NIC.
>> +
>> +When using TSO with DPDK, the implementation relies on the multi-segment 
>> mbufs
>> +feature, described in :doc:`/topics/dpdk/jumbo-frames`, where each mbuf
>> +contains ~2KiB of the entire packet's data and is linked to the next mbuf 
>> that
>> +contains the next portion of data.
>> +
>> +Enabling TSO
>> +~~~~~~~~~~~~
>> +Once multi-segment mbufs is enabled, TSO will be enabled by default, if 
>> there's
>> +support for it in the underlying physical NICs attached to OvS-DPDK.
>> +
> 
> Probably worth calling out above with an Important: tag.
> 
>> +When using :doc:`vHost User ports <vhost-user>`, TSO may be enabled in one 
>> of
>> +two ways, as follows.
>> +
>> +`TSO` is enabled in OvS by the DPDK vHost User backend; when a new guest
>> +connection is established, `TSO` is thus advertised to the guest as an
>> +available feature:
>> +
>> +  1. QEMU Command Line Parameter:
>> +     ```
>> +     sudo $QEMU_DIR/x86_64-softmmu/qemu-system-x86_64 \
>> +     ...
>> +     -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,\
>> +     mrg_rxbuf=on,csum=on,gso=on,guest_csum=on,guest_tso4=on,\
>> +     guest_tso6=on,guest_ecn=on \
>> +     ...
>> +     ```
> 
> Your passing gso above as well as tso, is that required? I wasn't sure 
> myself.

Good point. I've cleaned this example to include the bare minimum; GSO
is not required here.

> 
>> +
>> +  2. ethtool
>> +
>> +Assuming that the guest's OS also supports `TSO`, ethtool can be used to 
>> enable
>> +same:
>> +
>> +    ```
>> +    ethtool -K eth0 sg on     # scatter-gather is a prerequisite for TSO
>> +    ethtool -K eth0 tso on
>> +    ethtool -k eth0
>> +    ```
>> +
>> +    <b>Note:</b> In both methods, `mergeable buffers` are required:
> Would call out above as important rather than note.
> 
>> +    ```
>> +    sudo $QEMU_DIR/x86_64-softmmu/qemu-system-x86_64 \
>> +    ...
>> +    mrg_rxbuf=on,\
>> +    ...
>> +    ```
>> +
>> +To enable TSO in a guest, the underlying NIC must first support `TSO` - 
>> consult
>> +your controller's datasheet for compatibility. Secondly, the NIC must have 
>> an
>> +associated DPDK Poll Mode Driver (PMD) which supports `TSO`.
>> +
>> +~~~~~~~~~~~
>> +Limitations
>> +~~~~~~~~~~~
>> +The current OvS `TSO` implementation supports flat and VLAN networks only 
>> (i.e.
>> +no support for `TSO` over tunneled connection [VxLAN, GRE, IPinIP, etc.]).
>> +
>> +Also, as TSO is built on top of multi-segments mbufs, the constraints 
>> pointed
>> +out in :doc:`/topics/dpdk/jumbo-frames` also apply for TSO. Thus, some
>> +performance hits might be noticed when running specific functionalitry, like
> Minor typo in 'functionality' above.
> 
>> +the Userspace Connection tracker. And as mentioned in the same section, it 
>> is
>> +paramount that a packet's headers is contained within the first mbuf (~2KiB 
>> in
>> +size).
>> diff --git a/NEWS b/NEWS
>> index 98f5a9b..c7cfd17 100644
>> --- a/NEWS
>> +++ b/NEWS
>> @@ -23,6 +23,7 @@ Post-v2.10.0
>>        * Add option for simple round-robin based Rxq to PMD assignment.
>>          It can be set with pmd-rxq-assign.
>>        * Add support for DPDK 18.11
>> +     * Add support for TSO (between DPDK interfaces only).
> 
> can you add the experimental tag next to this (there's a few examples in 
> the news doc).
> 
>>      - Add 'symmetric_l3' hash function.
>>      - OVS now honors 'updelay' and 'downdelay' for bonds with LACP 
>> configured.
>>      - ovs-vswitchd:
>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>> index ad7223a..7af37ee 100644
>> --- a/lib/netdev-dpdk.c
>> +++ b/lib/netdev-dpdk.c
>> @@ -374,7 +374,8 @@ struct ingress_policer {
>>   enum dpdk_hw_ol_features {
>>       NETDEV_RX_CHECKSUM_OFFLOAD = 1 << 0,
>>       NETDEV_RX_HW_CRC_STRIP = 1 << 1,
>> -    NETDEV_RX_HW_SCATTER = 1 << 2
>> +    NETDEV_RX_HW_SCATTER = 1 << 2,
>> +    NETDEV_TX_TSO_OFFLOAD = 1 << 3,
>>   };
>>   
>>   /*
>> @@ -1005,19 +1006,26 @@ dpdk_eth_dev_port_config(struct netdev_dpdk *dev, 
>> int n_rxq, int n_txq)
>>       /* Multi-segment-mbuf-specific setup. */
>>       if (dpdk_multi_segment_mbufs) {
>>           if (info.tx_offload_capa & DEV_TX_OFFLOAD_MULTI_SEGS) {
>> -            /* Enable multi-seg mbufs. DPDK PMDs typically attempt to use
>> -             * simple or vectorized transmit functions, neither of which are
>> -             * compatible with multi-segment mbufs. */
>> +            /* DPDK PMDs typically attempt to use simple or vectorized
>> +             * transmit functions, neither of which are compatible with
>> +             * multi-segment mbufs. Ensure that these are disabled when
>> +             * multi-segment mbufs are enabled.
>> +             */
> I would think this comment should be added when the conf.txmode.offloads 
> |= DEV_TX_OFFLOAD_MULTI_SEGS; cod below was introduced rather than at 
> this stage?
>>               conf.txmode.offloads |= DEV_TX_OFFLOAD_MULTI_SEGS;
>> -        } else {
>> -            VLOG_WARN("Interface %s doesn't support multi-segment mbufs",
>> -                      dev->up.name);
>> -            conf.txmode.offloads &= ~DEV_TX_OFFLOAD_MULTI_SEGS;
> Although it's removed here, I believe we discussed that rather than a 
> warning the device should fail to configure if MSEG is enabled but the 
> device doesn't support?

This seems to have been something I missed while rebasing TSO on top of
the latest multi-segs. I've fixed it now.

>> +        }
>> +
>> +        if (dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD) {
>> +            conf.txmode.offloads |= DEV_TX_OFFLOAD_TCP_TSO;
>> +            conf.txmode.offloads |= DEV_TX_OFFLOAD_TCP_CKSUM;
>> +            conf.txmode.offloads |= DEV_TX_OFFLOAD_IPV4_CKSUM;
>>           }
>>   
>>           txconf = info.default_txconf;
>> -        //txconf.txq_flags = ETH_TXQ_FLAGS_IGNORE;
> I guess this is left over from previous dev work you had, should be 
> removed from this patch for the next revision.
> 
>>           txconf.offloads = conf.txmode.offloads;
>> +    } else if (dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD) {
>> +        dev->hw_ol_features &= ~NETDEV_TX_TSO_OFFLOAD;
>> +        VLOG_WARN("Failed to set Tx TSO offload in %s. Requires option "
>> +                  "`dpdk-multi-seg-mbufs` to be enabled.", dev->up.name);
>>       }
>>   
>>       conf.intr_conf.lsc = dev->lsc_interrupt_mode;
>> @@ -1134,6 +1142,9 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
>>       uint32_t rx_chksm_offload_capa = DEV_RX_OFFLOAD_UDP_CKSUM |
>>                                        DEV_RX_OFFLOAD_TCP_CKSUM |
>>                                        DEV_RX_OFFLOAD_IPV4_CKSUM;
>> +    uint32_t tx_tso_offload_capa = DEV_TX_OFFLOAD_TCP_TSO |
>> +                                   DEV_TX_OFFLOAD_TCP_CKSUM |
>> +                                   DEV_TX_OFFLOAD_IPV4_CKSUM;
>>   
>>       rte_eth_dev_info_get(dev->port_id, &info);
>>   
>> @@ -1160,6 +1171,18 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
>>           dev->hw_ol_features &= ~NETDEV_RX_HW_SCATTER;
>>       }
>>   
>> +    if (dpdk_multi_segment_mbufs) {
>> +        if (info.tx_offload_capa & tx_tso_offload_capa) {
>> +            dev->hw_ol_features |= NETDEV_TX_TSO_OFFLOAD;
>> +        } else {
>> +            dev->hw_ol_features &= ~NETDEV_TX_TSO_OFFLOAD;
>> +            VLOG_WARN("Tx TSO offload is not supported on port "
>> +                      DPDK_PORT_ID_FMT, dev->port_id);
>> +        }
>> +    } else {
>> +        dev->hw_ol_features &= ~NETDEV_TX_TSO_OFFLOAD;
>> +    }
>> +
>>       n_rxq = MIN(info.max_rx_queues, dev->up.n_rxq);
>>       n_txq = MIN(info.max_tx_queues, dev->up.n_txq);
>>   
>> @@ -1684,6 +1707,11 @@ netdev_dpdk_get_config(const struct netdev *netdev, 
>> struct smap *args)
>>           } else {
>>               smap_add(args, "rx_csum_offload", "false");
>>           }
>> +        if (dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD) {
>> +            smap_add(args, "tx_tso_offload", "true");
>> +        } else {
>> +            smap_add(args, "tx_tso_offload", "false");
>> +        }
>>           smap_add(args, "lsc_interrupt_mode",
>>                    dev->lsc_interrupt_mode ? "true" : "false");
>>       }
>> @@ -2372,6 +2400,15 @@ netdev_dpdk_filter_packet_len(struct netdev_dpdk 
>> *dev, struct rte_mbuf **pkts,
>>       for (i = 0; i < pkt_cnt; i++) {
>>           pkt = pkts[i];
>>   
>> +        /* Drop TSO packet if there's no TSO support on egress port */
>> +        if ((pkt->ol_flags & PKT_TX_TCP_SEG) &&
>> +            !(dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD)) {
>> +            VLOG_WARN_RL(&rl, "%s: TSO is disabled on port, TSO packet 
>> dropped"
>> +                        "%" PRIu32 " ", dev->up.name, pkt->pkt_len);
>> +            rte_pktmbuf_free(pkt);
>> +            continue;
>> +        }
> Hmmm, have we changed the nature of the function now. where as before it 
> was just checking packet length now we inspect for offload type as well.
> 
> It could be better to rename the function 'netdev_dpdk_filter_packet' 
> and add an explanation in the comment preceding the function WRT what 
> filters are applied (length and TSO offload support).

Good point. I've renamed the function to the above and added a function
comment to make it clear.

> 
> Ian
>> +
>>           if (OVS_UNLIKELY(pkt->pkt_len > dev->max_packet_len)) {
>>               if (!(pkt->ol_flags & PKT_TX_TCP_SEG)) {
>>                   VLOG_WARN_RL(&rl, "%s: Too big size %" PRIu32 " "
>> @@ -4245,6 +4282,14 @@ dpdk_vhost_reconfigure_helper(struct netdev_dpdk *dev)
>>           dev->tx_q[0].map = 0;
>>       }
>>   
>> +    if (dpdk_multi_segment_mbufs) {
>> +        dev->hw_ol_features |= NETDEV_TX_TSO_OFFLOAD;
>> +
>> +        VLOG_DBG("%s: TSO enabled on vhost port", dev->up.name);
>> +    } else {
>> +        dev->hw_ol_features &= ~NETDEV_TX_TSO_OFFLOAD;
>> +    }
>> +
>>       netdev_dpdk_remap_txqs(dev);
>>   
>>       err = netdev_dpdk_mempool_configure(dev);
>>
> 
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to