TCP Segmentation Offload (TSO) is a feature which enables the TCP/IP network stack to delegate segmentation of a TCP segment to the hardware NIC, thus saving compute resources. This may improve performance significantly for TCP workload in virtualized environments.
While a previous commit already added the necesary logic to netdev-dpdk to deal with packets marked for TSO, this set of changes enables TSO by default when using multi-segment mbufs. Thus, to enable TSO on the physical DPDK interfaces, only the following command needs to be issued before starting OvS: ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true Co-authored-by: Mark Kavanagh <mark.b.kavan...@intel.com> Signed-off-by: Mark Kavanagh <mark.b.kavan...@intel.com> Signed-off-by: Tiago Lam <tiago....@intel.com> --- Documentation/automake.mk | 1 + Documentation/topics/dpdk/index.rst | 1 + Documentation/topics/dpdk/tso.rst | 99 +++++++++++++++++++++++++++++++++++++ NEWS | 1 + lib/netdev-dpdk.c | 70 ++++++++++++++++++++++++-- 5 files changed, 167 insertions(+), 5 deletions(-) create mode 100644 Documentation/topics/dpdk/tso.rst diff --git a/Documentation/automake.mk b/Documentation/automake.mk index 082438e..a20deb8 100644 --- a/Documentation/automake.mk +++ b/Documentation/automake.mk @@ -39,6 +39,7 @@ DOC_SOURCE = \ Documentation/topics/dpdk/index.rst \ Documentation/topics/dpdk/bridge.rst \ Documentation/topics/dpdk/jumbo-frames.rst \ + Documentation/topics/dpdk/tso.rst \ Documentation/topics/dpdk/memory.rst \ Documentation/topics/dpdk/pdump.rst \ Documentation/topics/dpdk/phy.rst \ diff --git a/Documentation/topics/dpdk/index.rst b/Documentation/topics/dpdk/index.rst index cf24a7b..eb2a04d 100644 --- a/Documentation/topics/dpdk/index.rst +++ b/Documentation/topics/dpdk/index.rst @@ -40,4 +40,5 @@ The DPDK Datapath /topics/dpdk/qos /topics/dpdk/pdump /topics/dpdk/jumbo-frames + /topics/dpdk/tso /topics/dpdk/memory diff --git a/Documentation/topics/dpdk/tso.rst b/Documentation/topics/dpdk/tso.rst new file mode 100644 index 0000000..14f8c39 --- /dev/null +++ b/Documentation/topics/dpdk/tso.rst @@ -0,0 +1,99 @@ +.. + Copyright 2018, Red Hat, Inc. + + Licensed under the Apache License, Version 2.0 (the "License"); you may + not use this file except in compliance with the License. You may obtain + a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, WITHOUT + WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the + License for the specific language governing permissions and limitations + under the License. + + Convention for heading levels in Open vSwitch documentation: + + ======= Heading 0 (reserved for the title in a document) + ------- Heading 1 + ~~~~~~~ Heading 2 + +++++++ Heading 3 + ''''''' Heading 4 + + Avoid deeper levels because they do not render well. + +=== +TSO +=== + +**Note:** This feature is considered experimental. + +TCP Segmentation Offload (TSO) is a mechanism which allows a TCP/IP stack to +offload the TCP segmentation into hardware, thus saving the cycles that would +be required to perform this same segmentation in software. + +TCP Segmentation Offload (TSO) enables a network stack to delegate segmentation +of an oversized TCP segment to the underlying physical NIC. Offload of frame +segmentation achieves computational savings in the core, freeing up CPU cycles +for more useful work. + +A common use case for TSO is when using virtualization, where traffic that's +coming in from a VM can offload the TCP segmentation, thus avoiding the +fragmentation in software. Additionally, if the traffic is headed to a VM +within the same host further optimization can be expected. As the traffic never +leaves the machine, no MTU needs to be accounted for, and thus no segmentation +and checksum calculations are required, which saves yet more cycles. Only when +the traffic actually leaves the host the segmentation needs to happen, in which +case it will be performed by the egress NIC. + +When using TSO with DPDK, the implementation relies on the multi-segment mbufs +feature, described in :doc:`/topics/dpdk/jumbo-frames`, where each mbuf +contains ~2KiB of the entire packet's data and is linked to the next mbuf that +contains the next portion of data. + +Enabling TSO +~~~~~~~~~~~~ +.. Important:: + + Once multi-segment mbufs is enabled, TSO will be enabled by default, if + there's support for it in the underlying physical NICs attached to + OvS-DPDK. + +When using :doc:`vHost User ports <vhost-user>`, TSO may be enabled in one of +two ways, as follows. + +`TSO` is enabled in OvS by the DPDK vHost User backend; when a new guest +connection is established, `TSO` is thus advertised to the guest as an +available feature: + +1. QEMU Command Line Parameter:: + + $ sudo $QEMU_DIR/x86_64-softmmu/qemu-system-x86_64 \ + ... + -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,\ + csum=on,guest_csum=on,guest_tso4=on,guest_tso6=on\ + ... + +2. Ethtool. Assuming that the guest's OS also supports `TSO`, ethtool can be used to enable same:: + + $ ethtool -K eth0 sg on # scatter-gather is a prerequisite for TSO + $ ethtool -K eth0 tso on + $ ethtool -k eth0 + +To enable TSO in a guest, the underlying NIC must first support `TSO` - consult +your controller's datasheet for compatibility. Secondly, the NIC must have an +associated DPDK Poll Mode Driver (PMD) which supports `TSO`. + +~~~~~~~~~~~ +Limitations +~~~~~~~~~~~ +The current OvS `TSO` implementation supports flat and VLAN networks only (i.e. +no support for `TSO` over tunneled connection [VxLAN, GRE, IPinIP, etc.]). + +Also, as TSO is built on top of multi-segments mbufs, the constraints pointed +out in :doc:`/topics/dpdk/jumbo-frames` also apply for TSO. Thus, some +performance hits might be noticed when running specific functionality, like +the Userspace Connection tracker. And as mentioned in the same section, it is +paramount that a packet's headers is contained within the first mbuf (~2KiB in +size). diff --git a/NEWS b/NEWS index 98f5a9b..dc07b5a 100644 --- a/NEWS +++ b/NEWS @@ -23,6 +23,7 @@ Post-v2.10.0 * Add option for simple round-robin based Rxq to PMD assignment. It can be set with pmd-rxq-assign. * Add support for DPDK 18.11 + * Add support for TSO (experimental, between DPDK interfaces only). - Add 'symmetric_l3' hash function. - OVS now honors 'updelay' and 'downdelay' for bonds with LACP configured. - ovs-vswitchd: diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index b30d791..5a855fc 100644 --- a/lib/netdev-dpdk.c +++ b/lib/netdev-dpdk.c @@ -374,7 +374,8 @@ struct ingress_policer { enum dpdk_hw_ol_features { NETDEV_RX_CHECKSUM_OFFLOAD = 1 << 0, NETDEV_RX_HW_CRC_STRIP = 1 << 1, - NETDEV_RX_HW_SCATTER = 1 << 2 + NETDEV_RX_HW_SCATTER = 1 << 2, + NETDEV_TX_TSO_OFFLOAD = 1 << 3, }; /* @@ -1019,8 +1020,18 @@ dpdk_eth_dev_port_config(struct netdev_dpdk *dev, int n_rxq, int n_txq) return -ENOTSUP; } + if (dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD) { + conf.txmode.offloads |= DEV_TX_OFFLOAD_TCP_TSO; + conf.txmode.offloads |= DEV_TX_OFFLOAD_TCP_CKSUM; + conf.txmode.offloads |= DEV_TX_OFFLOAD_IPV4_CKSUM; + } + txconf = info.default_txconf; txconf.offloads = conf.txmode.offloads; + } else if (dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD) { + dev->hw_ol_features &= ~NETDEV_TX_TSO_OFFLOAD; + VLOG_WARN("Failed to set Tx TSO offload in %s. Requires option " + "`dpdk-multi-seg-mbufs` to be enabled.", dev->up.name); } conf.intr_conf.lsc = dev->lsc_interrupt_mode; @@ -1137,6 +1148,9 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev) uint32_t rx_chksm_offload_capa = DEV_RX_OFFLOAD_UDP_CKSUM | DEV_RX_OFFLOAD_TCP_CKSUM | DEV_RX_OFFLOAD_IPV4_CKSUM; + uint32_t tx_tso_offload_capa = DEV_TX_OFFLOAD_TCP_TSO | + DEV_TX_OFFLOAD_TCP_CKSUM | + DEV_TX_OFFLOAD_IPV4_CKSUM; rte_eth_dev_info_get(dev->port_id, &info); @@ -1163,6 +1177,18 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev) dev->hw_ol_features &= ~NETDEV_RX_HW_SCATTER; } + if (dpdk_multi_segment_mbufs) { + if (info.tx_offload_capa & tx_tso_offload_capa) { + dev->hw_ol_features |= NETDEV_TX_TSO_OFFLOAD; + } else { + dev->hw_ol_features &= ~NETDEV_TX_TSO_OFFLOAD; + VLOG_WARN("Tx TSO offload is not supported on port " + DPDK_PORT_ID_FMT, dev->port_id); + } + } else { + dev->hw_ol_features &= ~NETDEV_TX_TSO_OFFLOAD; + } + n_rxq = MIN(info.max_rx_queues, dev->up.n_rxq); n_txq = MIN(info.max_tx_queues, dev->up.n_txq); @@ -1687,6 +1713,11 @@ netdev_dpdk_get_config(const struct netdev *netdev, struct smap *args) } else { smap_add(args, "rx_csum_offload", "false"); } + if (dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD) { + smap_add(args, "tx_tso_offload", "true"); + } else { + smap_add(args, "tx_tso_offload", "false"); + } smap_add(args, "lsc_interrupt_mode", dev->lsc_interrupt_mode ? "true" : "false"); } @@ -2363,9 +2394,21 @@ netdev_dpdk_qos_run(struct netdev_dpdk *dev, struct rte_mbuf **pkts, return cnt; } +/* Filters a DPDK packet by the following criteria: + * - A packet is marked for TSO but the egress dev doesn't + * support TSO; + * - A packet pkt_len is bigger than the pre-defined + * max_packet_len, and the packet isn't marked for TSO. + * + * If any of the above case applies, the packet is then freed + * from 'pkts'. Otherwise the packet is kept in 'pkts' + * untouched. + * + * Returns the number of unfiltered packets left in 'pkts'. + */ static int -netdev_dpdk_filter_packet_len(struct netdev_dpdk *dev, struct rte_mbuf **pkts, - int pkt_cnt) +netdev_dpdk_filter_packet(struct netdev_dpdk *dev, struct rte_mbuf **pkts, + int pkt_cnt) { int i = 0; int cnt = 0; @@ -2375,6 +2418,15 @@ netdev_dpdk_filter_packet_len(struct netdev_dpdk *dev, struct rte_mbuf **pkts, for (i = 0; i < pkt_cnt; i++) { pkt = pkts[i]; + /* Drop TSO packet if there's no TSO support on egress port. */ + if ((pkt->ol_flags & PKT_TX_TCP_SEG) && + !(dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD)) { + VLOG_WARN_RL(&rl, "%s: TSO is disabled on port, TSO packet dropped" + "%" PRIu32 " ", dev->up.name, pkt->pkt_len); + rte_pktmbuf_free(pkt); + continue; + } + if (OVS_UNLIKELY(pkt->pkt_len > dev->max_packet_len)) { if (!(pkt->ol_flags & PKT_TX_TCP_SEG)) { VLOG_WARN_RL(&rl, "%s: Too big size %" PRIu32 " " @@ -2445,7 +2497,7 @@ __netdev_dpdk_vhost_send(struct netdev *netdev, int qid, rte_spinlock_lock(&dev->tx_q[qid].tx_lock); - cnt = netdev_dpdk_filter_packet_len(dev, cur_pkts, cnt); + cnt = netdev_dpdk_filter_packet(dev, cur_pkts, cnt); /* Check has QoS has been configured for the netdev */ cnt = netdev_dpdk_qos_run(dev, cur_pkts, cnt, true); dropped = total_pkts - cnt; @@ -2656,7 +2708,7 @@ netdev_dpdk_send__(struct netdev_dpdk *dev, int qid, int batch_cnt = dp_packet_batch_size(batch); struct rte_mbuf **pkts = (struct rte_mbuf **) batch->packets; - tx_cnt = netdev_dpdk_filter_packet_len(dev, pkts, batch_cnt); + tx_cnt = netdev_dpdk_filter_packet(dev, pkts, batch_cnt); tx_cnt = netdev_dpdk_qos_run(dev, pkts, tx_cnt, true); dropped = batch_cnt - tx_cnt; @@ -4249,6 +4301,14 @@ dpdk_vhost_reconfigure_helper(struct netdev_dpdk *dev) dev->tx_q[0].map = 0; } + if (dpdk_multi_segment_mbufs) { + dev->hw_ol_features |= NETDEV_TX_TSO_OFFLOAD; + + VLOG_DBG("%s: TSO enabled on vhost port", dev->up.name); + } else { + dev->hw_ol_features &= ~NETDEV_TX_TSO_OFFLOAD; + } + netdev_dpdk_remap_txqs(dev); err = netdev_dpdk_mempool_configure(dev); -- 2.7.4 _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev