Signed-off-by: Flavio Leitner <[email protected]>
---
Documentation/automake.mk | 1 +
Documentation/topics/dpdk/index.rst | 1 +
Documentation/topics/dpdk/tso.rst | 96 +++++++++
NEWS | 1 +
lib/automake.mk | 2 +
lib/conntrack.c | 29 ++-
lib/dp-packet.h | 152 +++++++++++++-
lib/ipf.c | 32 +--
lib/netdev-dpdk.c | 312 ++++++++++++++++++++++++----
lib/netdev-linux-private.h | 4 +
lib/netdev-linux.c | 296 +++++++++++++++++++++++---
lib/netdev-provider.h | 10 +
lib/netdev.c | 66 +++++-
lib/tso.c | 54 +++++
lib/tso.h | 23 ++
vswitchd/bridge.c | 2 +
vswitchd/vswitch.xml | 12 ++
17 files changed, 1002 insertions(+), 91 deletions(-)
create mode 100644 Documentation/topics/dpdk/tso.rst
create mode 100644 lib/tso.c
create mode 100644 lib/tso.h
Changelog:
- v3
* Improved the documentation.
* Updated copyright year to 2020.
* TSO offloaded msg now includes the netdev's name.
* Added period at the end of all code comments.
* Warn and drop encapsulation of TSO packets.
* Fixed travis issue with restricted virtio types.
* Fixed double headroom allocation in dpdk_copy_dp_packet_to_mbuf()
which caused packet corruption.
* Fixed netdev_dpdk_prep_hwol_packet() to unconditionally set
PKT_TX_IP_CKSUM only for IPv4 packets.
diff --git a/Documentation/automake.mk b/Documentation/automake.mk
index f2ca17bad..284327edd 100644
--- a/Documentation/automake.mk
+++ b/Documentation/automake.mk
@@ -35,6 +35,7 @@ DOC_SOURCE = \
Documentation/topics/dpdk/index.rst \
Documentation/topics/dpdk/bridge.rst \
Documentation/topics/dpdk/jumbo-frames.rst \
+ Documentation/topics/dpdk/tso.rst \
Documentation/topics/dpdk/memory.rst \
Documentation/topics/dpdk/pdump.rst \
Documentation/topics/dpdk/phy.rst \
diff --git a/Documentation/topics/dpdk/index.rst
b/Documentation/topics/dpdk/index.rst
index f2862ea70..400d56051 100644
--- a/Documentation/topics/dpdk/index.rst
+++ b/Documentation/topics/dpdk/index.rst
@@ -40,4 +40,5 @@ DPDK Support
/topics/dpdk/qos
/topics/dpdk/pdump
/topics/dpdk/jumbo-frames
+ /topics/dpdk/tso
/topics/dpdk/memory
diff --git a/Documentation/topics/dpdk/tso.rst
b/Documentation/topics/dpdk/tso.rst
new file mode 100644
index 000000000..189c86480
--- /dev/null
+++ b/Documentation/topics/dpdk/tso.rst
@@ -0,0 +1,96 @@
+..
+ Copyright 2020, Red Hat, Inc.
+
+ Licensed under the Apache License, Version 2.0 (the "License"); you may
+ not use this file except in compliance with the License. You may obtain
+ a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+ WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+ License for the specific language governing permissions and limitations
+ under the License.
+
+ Convention for heading levels in Open vSwitch documentation:
+
+ ======= Heading 0 (reserved for the title in a document)
+ ------- Heading 1
+ ~~~~~~~ Heading 2
+ +++++++ Heading 3
+ ''''''' Heading 4
+
+ Avoid deeper levels because they do not render well.
+
+========================
+Userspace Datapath - TSO
+========================
+
+**Note:** This feature is considered experimental.
+
+TCP Segmentation Offload (TSO) enables a network stack to delegate segmentation
+of an oversized TCP segment to the underlying physical NIC. Offload of frame
+segmentation achieves computational savings in the core, freeing up CPU cycles
+for more useful work.
+
+A common use case for TSO is when using virtualization, where traffic that's
+coming in from a VM can offload the TCP segmentation, thus avoiding the
+fragmentation in software. Additionally, if the traffic is headed to a VM
+within the same host further optimization can be expected. As the traffic never
+leaves the machine, no MTU needs to be accounted for, and thus no segmentation
+and checksum calculations are required, which saves yet more cycles. Only when
+the traffic actually leaves the host the segmentation needs to happen, in which
+case it will be performed by the egress NIC. Consult your controller's
+datasheet for compatibility. Secondly, the NIC must have an associated DPDK
+Poll Mode Driver (PMD) which supports `TSO`. For a list of features per PMD,
+refer to the `DPDK documentation`__.
+
+__ https://doc.dpdk.org/guides/nics/overview.html
+
+Enabling TSO
+~~~~~~~~~~~~
+
+The TSO support may be enabled via a global config value ``tso-support``.
+Setting this to ``true`` enables TSO support for all ports.
+
+ $ ovs-vsctl set Open_vSwitch . other_config:tso-support=true
+
+The default value is ``false``.
+
+Changing ``tso-support`` requires restarting the daemon.
+
+When using :doc:`vHost User ports <vhost-user>`, TSO may be enabled as follows.
+
+`TSO` is enabled in OvS by the DPDK vHost User backend; when a new guest
+connection is established, `TSO` is thus advertised to the guest as an
+available feature:
+
+QEMU Command Line Parameter::
+
+ $ sudo $QEMU_DIR/x86_64-softmmu/qemu-system-x86_64 \
+ ...
+ -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,\
+ csum=on,guest_csum=on,guest_tso4=on,guest_tso6=on\
+ ...
+
+2. Ethtool. Assuming that the guest's OS also supports `TSO`, ethtool can be
+used to enable same::
+
+ $ ethtool -K eth0 sg on # scatter-gather is a prerequisite for TSO
+ $ ethtool -K eth0 tso on
+ $ ethtool -k eth0
+
+~~~~~~~~~~~
+Limitations
+~~~~~~~~~~~
+
+The current OvS userspace `TSO` implementation supports flat and VLAN networks
+only (i.e. no support for `TSO` over tunneled connection [VxLAN, GRE, IPinIP,
+etc.]).
+
+There is no software implementation of TSO, so all ports attached to the
+datapath must support TSO or packets using that feature will be dropped
+on ports without TSO support. That also means guests using vhost-user
+in client mode will receive TSO packet regardless of TSO being enabled
+or disabled within the guest.
diff --git a/NEWS b/NEWS
index 965facaf8..306c0493d 100644
--- a/NEWS
+++ b/NEWS
@@ -26,6 +26,7 @@ Post-v2.12.0
* DPDK ring ports (dpdkr) are deprecated and will be removed in next
releases.
* Add support for DPDK 19.11.
+ * Add experimental support for TSO.
- RSTP:
* The rstp_statistics column in Port table will only be updated every
stats-update-interval configured in Open_vSwtich table.
diff --git a/lib/automake.mk b/lib/automake.mk
index ebf714501..94a1b4459 100644
--- a/lib/automake.mk
+++ b/lib/automake.mk
@@ -304,6 +304,8 @@ lib_libopenvswitch_la_SOURCES = \
lib/tnl-neigh-cache.h \
lib/tnl-ports.c \
lib/tnl-ports.h \
+ lib/tso.c \
+ lib/tso.h \
lib/netdev-native-tnl.c \
lib/netdev-native-tnl.h \
lib/token-bucket.c \
diff --git a/lib/conntrack.c b/lib/conntrack.c
index b80080e72..679054b98 100644
--- a/lib/conntrack.c
+++ b/lib/conntrack.c
@@ -2022,7 +2022,8 @@ conn_key_extract(struct conntrack *ct, struct dp_packet
*pkt, ovs_be16 dl_type,
if (hwol_bad_l3_csum) {
ok = false;
} else {
- bool hwol_good_l3_csum = dp_packet_ip_checksum_valid(pkt);
+ bool hwol_good_l3_csum = dp_packet_ip_checksum_valid(pkt)
+ || dp_packet_hwol_tx_ip_checksum(pkt);
/* Validate the checksum only when hwol is not supported. */
ok = extract_l3_ipv4(&ctx->key, l3, dp_packet_l3_size(pkt), NULL,
!hwol_good_l3_csum);
@@ -2036,7 +2037,8 @@ conn_key_extract(struct conntrack *ct, struct dp_packet
*pkt, ovs_be16 dl_type,
if (ok) {
bool hwol_bad_l4_csum = dp_packet_l4_checksum_bad(pkt);
if (!hwol_bad_l4_csum) {
- bool hwol_good_l4_csum = dp_packet_l4_checksum_valid(pkt);
+ bool hwol_good_l4_csum = dp_packet_l4_checksum_valid(pkt)
+ || dp_packet_hwol_tx_l4_checksum(pkt);
/* Validate the checksum only when hwol is not supported. */
if (extract_l4(&ctx->key, l4, dp_packet_l4_size(pkt),
&ctx->icmp_related, l3, !hwol_good_l4_csum,
@@ -3237,8 +3239,11 @@ handle_ftp_ctl(struct conntrack *ct, const struct
conn_lookup_ctx *ctx,
}
if (seq_skew) {
ip_len = ntohs(l3_hdr->ip_tot_len) + seq_skew;
- l3_hdr->ip_csum = recalc_csum16(l3_hdr->ip_csum,
- l3_hdr->ip_tot_len, htons(ip_len));
+ if (!dp_packet_hwol_tx_ip_checksum(pkt)) {
+ l3_hdr->ip_csum = recalc_csum16(l3_hdr->ip_csum,
+ l3_hdr->ip_tot_len,
+ htons(ip_len));
+ }
l3_hdr->ip_tot_len = htons(ip_len);
}
}
@@ -3256,13 +3261,15 @@ handle_ftp_ctl(struct conntrack *ct, const struct
conn_lookup_ctx *ctx,
}
th->tcp_csum = 0;
- if (ctx->key.dl_type == htons(ETH_TYPE_IPV6)) {
- th->tcp_csum = packet_csum_upperlayer6(nh6, th, ctx->key.nw_proto,
- dp_packet_l4_size(pkt));
- } else {
- uint32_t tcp_csum = packet_csum_pseudoheader(l3_hdr);
- th->tcp_csum = csum_finish(
- csum_continue(tcp_csum, th, dp_packet_l4_size(pkt)));
+ if (!dp_packet_hwol_tx_l4_checksum(pkt)) {
+ if (ctx->key.dl_type == htons(ETH_TYPE_IPV6)) {
+ th->tcp_csum = packet_csum_upperlayer6(nh6, th, ctx->key.nw_proto,
+ dp_packet_l4_size(pkt));
+ } else {
+ uint32_t tcp_csum = packet_csum_pseudoheader(l3_hdr);
+ th->tcp_csum = csum_finish(
+ csum_continue(tcp_csum, th, dp_packet_l4_size(pkt)));
+ }
}
if (seq_skew) {
diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index 133942155..d10a0416e 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -114,6 +114,8 @@ static inline void dp_packet_set_size(struct dp_packet *,
uint32_t);
static inline uint16_t dp_packet_get_allocated(const struct dp_packet *);
static inline void dp_packet_set_allocated(struct dp_packet *, uint16_t);
+void dp_packet_prepend_vnet_hdr(struct dp_packet *, int mtu);
+
void *dp_packet_resize_l2(struct dp_packet *, int increment);
void *dp_packet_resize_l2_5(struct dp_packet *, int increment);
static inline void *dp_packet_eth(const struct dp_packet *);
@@ -456,7 +458,7 @@ dp_packet_init_specific(struct dp_packet *p)
{
/* This initialization is needed for packets that do not come from DPDK
* interfaces, when vswitchd is built with --with-dpdk. */
- p->mbuf.tx_offload = p->mbuf.packet_type = 0;
+ p->mbuf.ol_flags = p->mbuf.tx_offload = p->mbuf.packet_type = 0;
p->mbuf.nb_segs = 1;
p->mbuf.next = NULL;
}
@@ -519,6 +521,80 @@ dp_packet_set_allocated(struct dp_packet *b, uint16_t s)
b->mbuf.buf_len = s;
}
+static inline bool
+dp_packet_hwol_is_tso(const struct dp_packet *b)
+{
+ return (b->mbuf.ol_flags & (PKT_TX_TCP_SEG | PKT_TX_L4_MASK))
+ ? true
+ : false;
+}
+
+static inline bool
+dp_packet_hwol_is_ipv4(const struct dp_packet *b)
+{
+ return b->mbuf.ol_flags & PKT_TX_IPV4 ? true : false;
+}
+
+static inline uint64_t
+dp_packet_hwol_l4_mask(const struct dp_packet *b)
+{
+ return b->mbuf.ol_flags & PKT_TX_L4_MASK;
+}
+
+static inline bool
+dp_packet_hwol_l4_is_tcp(const struct dp_packet *b)
+{
+ return (b->mbuf.ol_flags & PKT_TX_L4_MASK) == PKT_TX_TCP_CKSUM
+ ? true
+ : false;
+}
+
+static inline bool
+dp_packet_hwol_l4_is_udp(struct dp_packet *b)
+{
+ return (b->mbuf.ol_flags & PKT_TX_L4_MASK) == PKT_TX_UDP_CKSUM
+ ? true
+ : false;
+}
+
+static inline bool
+dp_packet_hwol_l4_is_sctp(struct dp_packet *b)
+{
+ return (b->mbuf.ol_flags & PKT_TX_L4_MASK) == PKT_TX_SCTP_CKSUM
+ ? true
+ : false;
+}
+
+static inline void
+dp_packet_hwol_set_tx_ipv4(struct dp_packet *b) {
+ b->mbuf.ol_flags |= PKT_TX_IPV4;
+}
+
+static inline void
+dp_packet_hwol_set_tx_ipv6(struct dp_packet *b) {
+ b->mbuf.ol_flags |= PKT_TX_IPV6;
+}
+
+static inline void
+dp_packet_hwol_set_csum_tcp(struct dp_packet *b) {
+ b->mbuf.ol_flags |= PKT_TX_TCP_CKSUM;
+}
+
+static inline void
+dp_packet_hwol_set_csum_udp(struct dp_packet *b) {
+ b->mbuf.ol_flags |= PKT_TX_UDP_CKSUM;
+}
+
+static inline void
+dp_packet_hwol_set_csum_sctp(struct dp_packet *b) {
+ b->mbuf.ol_flags |= PKT_TX_SCTP_CKSUM;
+}
+
+static inline void
+dp_packet_hwol_set_tcp_seg(struct dp_packet *b) {
+ b->mbuf.ol_flags |= PKT_TX_TCP_SEG;
+}
+
/* Returns the RSS hash of the packet 'p'. Note that the returned value is
* correct only if 'dp_packet_rss_valid(p)' returns true */
static inline uint32_t
@@ -648,6 +724,66 @@ dp_packet_set_allocated(struct dp_packet *b, uint16_t s)
b->allocated_ = s;
}
+static inline bool
+dp_packet_hwol_is_tso(const struct dp_packet *b OVS_UNUSED)
+{
+ return false;
+}
+
+static inline bool
+dp_packet_hwol_is_ipv4(const struct dp_packet *b OVS_UNUSED)
+{
+ return false;
+}
+
+static inline uint64_t
+dp_packet_hwol_l4_mask(const struct dp_packet *b OVS_UNUSED)
+{
+ return 0;
+}
+
+static inline bool
+dp_packet_hwol_l4_is_tcp(const struct dp_packet *b OVS_UNUSED)
+{
+ return false;
+}
+
+static inline bool
+dp_packet_hwol_l4_is_udp(const struct dp_packet *b OVS_UNUSED)
+{
+ return false;
+}
+
+static inline bool
+dp_packet_hwol_l4_is_sctp(const struct dp_packet *b OVS_UNUSED)
+{
+ return false;
+}
+
+static inline void
+dp_packet_hwol_set_tx_ipv4(struct dp_packet *b OVS_UNUSED) {
+}
+
+static inline void
+dp_packet_hwol_set_tx_ipv6(struct dp_packet *b OVS_UNUSED) {
+}
+
+static inline void
+dp_packet_hwol_set_csum_tcp(struct dp_packet *b OVS_UNUSED) {
+}
+
+static inline void
+dp_packet_hwol_set_csum_udp(struct dp_packet *b OVS_UNUSED) {
+}
+
+static inline void
+dp_packet_hwol_set_csum_sctp(struct dp_packet *b OVS_UNUSED) {
+}
+
+static inline void
+dp_packet_hwol_set_tcp_seg(struct dp_packet *b OVS_UNUSED) {
+}
+
/* Returns the RSS hash of the packet 'p'. Note that the returned value is
* correct only if 'dp_packet_rss_valid(p)' returns true */
static inline uint32_t
@@ -939,6 +1075,20 @@ dp_packet_batch_reset_cutlen(struct dp_packet_batch
*batch)
}
}
+static inline bool
+dp_packet_hwol_tx_ip_checksum(const struct dp_packet *p)
+{
+
+ return dp_packet_hwol_l4_mask(p) ? true : false;
+}
+
+static inline bool
+dp_packet_hwol_tx_l4_checksum(const struct dp_packet *p)
+{
+
+ return dp_packet_hwol_l4_mask(p) ? true : false;
+}
+
#ifdef __cplusplus
}
#endif
diff --git a/lib/ipf.c b/lib/ipf.c
index 45c489122..0f43593a2 100644
--- a/lib/ipf.c
+++ b/lib/ipf.c
@@ -433,9 +433,11 @@ ipf_reassemble_v4_frags(struct ipf_list *ipf_list)
len += rest_len;
l3 = dp_packet_l3(pkt);
ovs_be16 new_ip_frag_off = l3->ip_frag_off & ~htons(IP_MORE_FRAGMENTS);
- l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_frag_off,
- new_ip_frag_off);
- l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_tot_len, htons(len));
+ if (!dp_packet_hwol_tx_ip_checksum(pkt)) {
+ l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_frag_off,
+ new_ip_frag_off);
+ l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_tot_len, htons(len));
+ }
l3->ip_tot_len = htons(len);
l3->ip_frag_off = new_ip_frag_off;
dp_packet_set_l2_pad_size(pkt, 0);
@@ -606,6 +608,7 @@ ipf_is_valid_v4_frag(struct ipf *ipf, struct dp_packet *pkt)
}
if (OVS_UNLIKELY(!dp_packet_ip_checksum_valid(pkt)
+ && !dp_packet_hwol_tx_ip_checksum(pkt)
&& csum(l3, ip_hdr_len) != 0)) {
goto invalid_pkt;
}
@@ -1181,16 +1184,21 @@ ipf_post_execute_reass_pkts(struct ipf *ipf,
} else {
struct ip_header *l3_frag = dp_packet_l3(frag_0->pkt);
struct ip_header *l3_reass = dp_packet_l3(pkt);
- ovs_be32 reass_ip = get_16aligned_be32(&l3_reass->ip_src);
- ovs_be32 frag_ip = get_16aligned_be32(&l3_frag->ip_src);
- l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum,
- frag_ip, reass_ip);
- l3_frag->ip_src = l3_reass->ip_src;
+ if (!dp_packet_hwol_tx_ip_checksum(frag_0->pkt)) {
+ ovs_be32 reass_ip =
+ get_16aligned_be32(&l3_reass->ip_src);
+ ovs_be32 frag_ip =
+ get_16aligned_be32(&l3_frag->ip_src);
+
+ l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum,
+ frag_ip, reass_ip);
+ reass_ip = get_16aligned_be32(&l3_reass->ip_dst);
+ frag_ip = get_16aligned_be32(&l3_frag->ip_dst);
+ l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum,
+ frag_ip, reass_ip);
+ }
- reass_ip = get_16aligned_be32(&l3_reass->ip_dst);
- frag_ip = get_16aligned_be32(&l3_frag->ip_dst);
- l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum,
- frag_ip, reass_ip);
+ l3_frag->ip_src = l3_reass->ip_src;
l3_frag->ip_dst = l3_reass->ip_dst;
}
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index 5e09786ac..2de60aa3f 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -64,6 +64,7 @@
#include "smap.h"
#include "sset.h"
#include "timeval.h"
+#include "tso.h"
#include "unaligned.h"
#include "unixctl.h"
#include "util.h"
@@ -360,7 +361,8 @@ struct ingress_policer {
enum dpdk_hw_ol_features {
NETDEV_RX_CHECKSUM_OFFLOAD = 1 << 0,
NETDEV_RX_HW_CRC_STRIP = 1 << 1,
- NETDEV_RX_HW_SCATTER = 1 << 2
+ NETDEV_RX_HW_SCATTER = 1 << 2,
+ NETDEV_TX_TSO_OFFLOAD = 1 << 3,
};
/*
@@ -942,6 +944,12 @@ dpdk_eth_dev_port_config(struct netdev_dpdk *dev, int
n_rxq, int n_txq)
conf.rxmode.offloads |= DEV_RX_OFFLOAD_KEEP_CRC;
}
+ if (dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD) {
+ conf.txmode.offloads |= DEV_TX_OFFLOAD_TCP_TSO;
+ conf.txmode.offloads |= DEV_TX_OFFLOAD_TCP_CKSUM;
+ conf.txmode.offloads |= DEV_TX_OFFLOAD_IPV4_CKSUM;
+ }
+
/* Limit configured rss hash functions to only those supported
* by the eth device. */
conf.rx_adv_conf.rss_conf.rss_hf &= info.flow_type_rss_offloads;
@@ -1043,6 +1051,9 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
uint32_t rx_chksm_offload_capa = DEV_RX_OFFLOAD_UDP_CKSUM |
DEV_RX_OFFLOAD_TCP_CKSUM |
DEV_RX_OFFLOAD_IPV4_CKSUM;
+ uint32_t tx_tso_offload_capa = DEV_TX_OFFLOAD_TCP_TSO |
+ DEV_TX_OFFLOAD_TCP_CKSUM |
+ DEV_TX_OFFLOAD_IPV4_CKSUM;
rte_eth_dev_info_get(dev->port_id, &info);
@@ -1069,6 +1080,14 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
dev->hw_ol_features &= ~NETDEV_RX_HW_SCATTER;
}
+ if (info.tx_offload_capa & tx_tso_offload_capa) {
+ dev->hw_ol_features |= NETDEV_TX_TSO_OFFLOAD;
+ } else {
+ dev->hw_ol_features &= ~NETDEV_TX_TSO_OFFLOAD;
+ VLOG_WARN("Tx TSO offload is not supported on %s port "
+ DPDK_PORT_ID_FMT, netdev_get_name(&dev->up), dev->port_id);
+ }
+
n_rxq = MIN(info.max_rx_queues, dev->up.n_rxq);
n_txq = MIN(info.max_tx_queues, dev->up.n_txq);
@@ -1319,14 +1338,16 @@ netdev_dpdk_vhost_construct(struct netdev *netdev)
goto out;
}
- err = rte_vhost_driver_disable_features(dev->vhost_id,
- 1ULL << VIRTIO_NET_F_HOST_TSO4
- | 1ULL << VIRTIO_NET_F_HOST_TSO6
- | 1ULL << VIRTIO_NET_F_CSUM);
- if (err) {
- VLOG_ERR("rte_vhost_driver_disable_features failed for vhost user "
- "port: %s\n", name);
- goto out;
+ if (!tso_enabled()) {
+ err = rte_vhost_driver_disable_features(dev->vhost_id,
+ 1ULL << VIRTIO_NET_F_HOST_TSO4
+ | 1ULL << VIRTIO_NET_F_HOST_TSO6
+ | 1ULL << VIRTIO_NET_F_CSUM);
+ if (err) {
+ VLOG_ERR("rte_vhost_driver_disable_features failed for vhost user "
+ "port: %s\n", name);
+ goto out;
+ }
}
err = rte_vhost_driver_start(dev->vhost_id);
@@ -1661,6 +1682,11 @@ netdev_dpdk_get_config(const struct netdev *netdev,
struct smap *args)
} else {
smap_add(args, "rx_csum_offload", "false");
}
+ if (dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD) {
+ smap_add(args, "tx_tso_offload", "true");
+ } else {
+ smap_add(args, "tx_tso_offload", "false");
+ }
smap_add(args, "lsc_interrupt_mode",
dev->lsc_interrupt_mode ? "true" : "false");
}
@@ -2088,6 +2114,67 @@ netdev_dpdk_rxq_dealloc(struct netdev_rxq *rxq)
rte_free(rx);
}
+/* Prepare the packet for HWOL.
+ * Return True if the packet is OK to continue. */
+static bool
+netdev_dpdk_prep_hwol_packet(struct netdev_dpdk *dev, struct rte_mbuf *mbuf)
+{
+ struct dp_packet *pkt = CONTAINER_OF(mbuf, struct dp_packet, mbuf);
+
+ if (mbuf->ol_flags & PKT_TX_L4_MASK) {
+ mbuf->l2_len = (char *)dp_packet_l3(pkt) - (char *)dp_packet_eth(pkt);
+ mbuf->l3_len = (char *)dp_packet_l4(pkt) - (char *)dp_packet_l3(pkt);
+ mbuf->outer_l2_len = 0;
+ mbuf->outer_l3_len = 0;
+ }
+
+ if (mbuf->ol_flags & PKT_TX_TCP_SEG) {
+ struct tcp_header *th = dp_packet_l4(pkt);
+
+ if (!th) {
+ VLOG_WARN_RL(&rl, "%s: TCP Segmentation without L4 header"
+ " pkt len: %"PRIu32"", dev->up.name, mbuf->pkt_len);
+ return false;
+ }
+
+ mbuf->l4_len = TCP_OFFSET(th->tcp_ctl) * 4;
+ mbuf->ol_flags |= PKT_TX_TCP_CKSUM;
+ mbuf->tso_segsz = dev->mtu - mbuf->l3_len - mbuf->l4_len;
+
+ if (mbuf->ol_flags & PKT_TX_IPV4) {
+ mbuf->ol_flags |= PKT_TX_IP_CKSUM;
+ }
+ }
+ return true;
+}
+
+/* Prepare a batch for HWOL.
+ * Return the number of good packets in the batch. */
+static int
+netdev_dpdk_prep_hwol_batch(struct netdev_dpdk *dev, struct rte_mbuf **pkts,
+ int pkt_cnt)
+{
+ int i = 0;
+ int cnt = 0;
+ struct rte_mbuf *pkt;
+
+ /* Prepare and filter bad HWOL packets. */
+ for (i = 0; i < pkt_cnt; i++) {
+ pkt = pkts[i];
+ if (!netdev_dpdk_prep_hwol_packet(dev, pkt)) {
+ rte_pktmbuf_free(pkt);
+ continue;
+ }
+
+ if (OVS_UNLIKELY(i != cnt)) {
+ pkts[cnt] = pkt;
+ }
+ cnt++;
+ }
+
+ return cnt;
+}
+
/* Tries to transmit 'pkts' to txq 'qid' of device 'dev'. Takes ownership of
* 'pkts', even in case of failure.
*
@@ -2097,11 +2184,22 @@ netdev_dpdk_eth_tx_burst(struct netdev_dpdk *dev, int
qid,
struct rte_mbuf **pkts, int cnt)
{
uint32_t nb_tx = 0;
+ uint16_t nb_tx_prep = cnt;
+
+ if (tso_enabled()) {
+ nb_tx_prep = rte_eth_tx_prepare(dev->port_id, qid, pkts, cnt);
+ if (nb_tx_prep != cnt) {
+ VLOG_WARN_RL(&rl, "%s: Output batch contains invalid packets. "
+ "Only %u/%u are valid: %s", dev->up.name, nb_tx_prep,
+ cnt, rte_strerror(rte_errno));
+ }
+ }
- while (nb_tx != cnt) {
+ while (nb_tx != nb_tx_prep) {
uint32_t ret;
- ret = rte_eth_tx_burst(dev->port_id, qid, pkts + nb_tx, cnt - nb_tx);
+ ret = rte_eth_tx_burst(dev->port_id, qid, pkts + nb_tx,
+ nb_tx_prep - nb_tx);
if (!ret) {
break;
}
@@ -2386,11 +2484,14 @@ netdev_dpdk_filter_packet_len(struct netdev_dpdk *dev,
struct rte_mbuf **pkts,
int cnt = 0;
struct rte_mbuf *pkt;
+ /* Filter oversized packets, unless are marked for TSO. */
for (i = 0; i < pkt_cnt; i++) {
pkt = pkts[i];
- if (OVS_UNLIKELY(pkt->pkt_len > dev->max_packet_len)) {
- VLOG_WARN_RL(&rl, "%s: Too big size %" PRIu32 " max_packet_len %d",
- dev->up.name, pkt->pkt_len, dev->max_packet_len);
+ if (OVS_UNLIKELY((pkt->pkt_len > dev->max_packet_len)
+ && !(pkt->ol_flags & PKT_TX_TCP_SEG))) {
+ VLOG_WARN_RL(&rl, "%s: Too big size %" PRIu32 " "
+ "max_packet_len %d", dev->up.name, pkt->pkt_len,
+ dev->max_packet_len);
rte_pktmbuf_free(pkt);
continue;
}
@@ -2442,7 +2543,7 @@ __netdev_dpdk_vhost_send(struct netdev *netdev, int qid,
struct rte_mbuf **cur_pkts = (struct rte_mbuf **) pkts;
struct netdev_dpdk_sw_stats sw_stats_add;
unsigned int n_packets_to_free = cnt;
- unsigned int total_packets = cnt;
+ unsigned int total_packets;
int i, retries = 0;
int max_retries = VHOST_ENQ_RETRY_MIN;
int vid = netdev_dpdk_get_vid(dev);
@@ -2462,7 +2563,8 @@ __netdev_dpdk_vhost_send(struct netdev *netdev, int qid,
rte_spinlock_lock(&dev->tx_q[qid].tx_lock);
}
- cnt = netdev_dpdk_filter_packet_len(dev, cur_pkts, cnt);
+ total_packets = netdev_dpdk_prep_hwol_batch(dev, cur_pkts, cnt);
+ cnt = netdev_dpdk_filter_packet_len(dev, cur_pkts, total_packets);
sw_stats_add.tx_mtu_exceeded_drops = total_packets - cnt;
/* Check has QoS has been configured for the netdev */
@@ -2511,6 +2613,121 @@ out:
}
}
+static void
+netdev_dpdk_extbuf_free(void *addr OVS_UNUSED, void *opaque)
+{
+ rte_free(opaque);
+}
+
+static struct rte_mbuf *
+dpdk_pktmbuf_attach_extbuf(struct rte_mbuf *pkt, uint32_t data_len)
+{
+ uint32_t total_len = RTE_PKTMBUF_HEADROOM + data_len;
+ struct rte_mbuf_ext_shared_info *shinfo = NULL;
+ uint16_t buf_len;
+ void *buf;
+
+ if (rte_pktmbuf_tailroom(pkt) >= sizeof(*shinfo)) {
+ shinfo = rte_pktmbuf_mtod(pkt, struct rte_mbuf_ext_shared_info *);
+ } else {
+ total_len += sizeof(*shinfo) + sizeof(uintptr_t);
+ total_len = RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t));
+ }
+
+ if (unlikely(total_len > UINT16_MAX)) {
+ VLOG_ERR("Can't copy packet: too big %u", total_len);
+ return NULL;
+ }
+
+ buf_len = total_len;
+ buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
+ if (unlikely(buf == NULL)) {
+ VLOG_ERR("Failed to allocate memory using rte_malloc: %u", buf_len);
+ return NULL;
+ }
+
+ /* Initialize shinfo. */
+ if (shinfo) {
+ shinfo->free_cb = netdev_dpdk_extbuf_free;
+ shinfo->fcb_opaque = buf;
+ rte_mbuf_ext_refcnt_set(shinfo, 1);
+ } else {
+ shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len,
+ netdev_dpdk_extbuf_free,
+ buf);
+ if (unlikely(shinfo == NULL)) {
+ rte_free(buf);
+ VLOG_ERR("Failed to initialize shared info for mbuf while "
+ "attempting to attach an external buffer.");
+ return NULL;
+ }
+ }
+
+ rte_pktmbuf_attach_extbuf(pkt, buf, rte_malloc_virt2iova(buf), buf_len,
+ shinfo);
+ rte_pktmbuf_reset_headroom(pkt);
+
+ return pkt;
+}
+
+static struct rte_mbuf *
+dpdk_pktmbuf_alloc(struct rte_mempool *mp, uint32_t data_len)
+{
+ struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp);
+
+ if (OVS_UNLIKELY(!pkt)) {
+ return NULL;
+ }
+
+ dp_packet_init_specific((struct dp_packet *)pkt);
+ if (rte_pktmbuf_tailroom(pkt) >= data_len) {
+ return pkt;
+ }
+
+ if (dpdk_pktmbuf_attach_extbuf(pkt, data_len)) {
+ return pkt;
+ }
+
+ rte_pktmbuf_free(pkt);
+
+ return NULL;
+}
+
+static struct dp_packet *
+dpdk_copy_dp_packet_to_mbuf(struct rte_mempool *mp, struct dp_packet *pkt_orig)
+{
+ struct rte_mbuf *mbuf_dest;
+ struct dp_packet *pkt_dest;
+ uint32_t pkt_len;
+
+ pkt_len = dp_packet_size(pkt_orig);
+ mbuf_dest = dpdk_pktmbuf_alloc(mp, pkt_len);
+ if (OVS_UNLIKELY(mbuf_dest == NULL)) {
+ return NULL;
+ }
+
+ pkt_dest = CONTAINER_OF(mbuf_dest, struct dp_packet, mbuf);
+ memcpy(dp_packet_data(pkt_dest), dp_packet_data(pkt_orig), pkt_len);
+ dp_packet_set_size(pkt_dest, pkt_len);
+
+ mbuf_dest->tx_offload = pkt_orig->mbuf.tx_offload;
+ mbuf_dest->packet_type = pkt_orig->mbuf.packet_type;
+ mbuf_dest->ol_flags |= (pkt_orig->mbuf.ol_flags &
+ ~(EXT_ATTACHED_MBUF | IND_ATTACHED_MBUF));
+
+ memcpy(&pkt_dest->l2_pad_size, &pkt_orig->l2_pad_size,
+ sizeof(struct dp_packet) - offsetof(struct dp_packet, l2_pad_size));
+
+ if (mbuf_dest->ol_flags & PKT_TX_L4_MASK) {
+ mbuf_dest->l2_len = (char *)dp_packet_l3(pkt_dest)
+ - (char *)dp_packet_eth(pkt_dest);
+ mbuf_dest->l3_len = (char *)dp_packet_l4(pkt_dest)
+ - (char *) dp_packet_l3(pkt_dest);
+ }
+
+ return pkt_dest;
+}
+
/* Tx function. Transmit packets indefinitely */
static void
dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch)
@@ -2524,7 +2741,7 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid, struct
dp_packet_batch *batch)
enum { PKT_ARRAY_SIZE = NETDEV_MAX_BURST };
#endif
struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
- struct rte_mbuf *pkts[PKT_ARRAY_SIZE];
+ struct dp_packet *pkts[PKT_ARRAY_SIZE];
struct netdev_dpdk_sw_stats *sw_stats = dev->sw_stats;
uint32_t cnt = batch_cnt;
uint32_t dropped = 0;
@@ -2545,34 +2762,30 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid, struct
dp_packet_batch *batch)
struct dp_packet *packet = batch->packets[i];
uint32_t size = dp_packet_size(packet);
- if (OVS_UNLIKELY(size > dev->max_packet_len)) {
- VLOG_WARN_RL(&rl, "Too big size %u max_packet_len %d",
- size, dev->max_packet_len);
-
+ if (size > dev->max_packet_len
+ && !(packet->mbuf.ol_flags & PKT_TX_TCP_SEG)) {
+ VLOG_WARN_RL(&rl, "Too big size %u max_packet_len %d", size,
+ dev->max_packet_len);
mtu_drops++;
continue;
}
- pkts[txcnt] = rte_pktmbuf_alloc(dev->dpdk_mp->mp);
+ pkts[txcnt] = dpdk_copy_dp_packet_to_mbuf(dev->dpdk_mp->mp, packet);
if (OVS_UNLIKELY(!pkts[txcnt])) {
dropped = cnt - i;
break;
}
- /* We have to do a copy for now */
- memcpy(rte_pktmbuf_mtod(pkts[txcnt], void *),
- dp_packet_data(packet), size);
- dp_packet_set_size((struct dp_packet *)pkts[txcnt], size);
-
txcnt++;
}
if (OVS_LIKELY(txcnt)) {
if (dev->type == DPDK_DEV_VHOST) {
- __netdev_dpdk_vhost_send(netdev, qid, (struct dp_packet **) pkts,
- txcnt);
+ __netdev_dpdk_vhost_send(netdev, qid, pkts, txcnt);
} else {
- tx_failure = netdev_dpdk_eth_tx_burst(dev, qid, pkts, txcnt);
+ tx_failure += netdev_dpdk_eth_tx_burst(dev, qid,
+ (struct rte_mbuf **)pkts,
+ txcnt);
}
}
@@ -2630,6 +2843,7 @@ netdev_dpdk_send__(struct netdev_dpdk *dev, int qid,
int batch_cnt = dp_packet_batch_size(batch);
struct rte_mbuf **pkts = (struct rte_mbuf **) batch->packets;
+ batch_cnt = netdev_dpdk_prep_hwol_batch(dev, pkts, batch_cnt);
tx_cnt = netdev_dpdk_filter_packet_len(dev, pkts, batch_cnt);
mtu_drops = batch_cnt - tx_cnt;
qos_drops = tx_cnt;
@@ -4345,6 +4559,12 @@ netdev_dpdk_reconfigure(struct netdev *netdev)
rte_free(dev->tx_q);
err = dpdk_eth_dev_init(dev);
+ if (dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD) {
+ netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_TSO;
+ netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_CKSUM;
+ netdev->ol_flags |= NETDEV_TX_OFFLOAD_IPV4_CKSUM;
+ }
+
dev->tx_q = netdev_dpdk_alloc_txq(netdev->n_txq);
if (!dev->tx_q) {
err = ENOMEM;
@@ -4374,6 +4594,11 @@ dpdk_vhost_reconfigure_helper(struct netdev_dpdk *dev)
dev->tx_q[0].map = 0;
}
+ if (tso_enabled()) {
+ dev->hw_ol_features |= NETDEV_TX_TSO_OFFLOAD;
+ VLOG_DBG("%s: TSO enabled on vhost port", netdev_get_name(&dev->up));
+ }
+
netdev_dpdk_remap_txqs(dev);
err = netdev_dpdk_mempool_configure(dev);
@@ -4446,6 +4671,11 @@ netdev_dpdk_vhost_client_reconfigure(struct netdev
*netdev)
vhost_flags |= RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
}
+ /* Enable External Buffers if TCP Segmentation Offload is enabled. */
+ if (tso_enabled()) {
+ vhost_flags |= RTE_VHOST_USER_EXTBUF_SUPPORT;
+ }
+
err = rte_vhost_driver_register(dev->vhost_id, vhost_flags);
if (err) {
VLOG_ERR("vhost-user device setup failure for device %s\n",
@@ -4470,14 +4700,20 @@ netdev_dpdk_vhost_client_reconfigure(struct netdev
*netdev)
goto unlock;
}
- err = rte_vhost_driver_disable_features(dev->vhost_id,
- 1ULL << VIRTIO_NET_F_HOST_TSO4
- | 1ULL << VIRTIO_NET_F_HOST_TSO6
- | 1ULL << VIRTIO_NET_F_CSUM);
- if (err) {
- VLOG_ERR("rte_vhost_driver_disable_features failed for vhost user "
- "client port: %s\n", dev->up.name);
- goto unlock;
+ if (tso_enabled()) {
+ netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_TSO;
+ netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_CKSUM;
+ netdev->ol_flags |= NETDEV_TX_OFFLOAD_IPV4_CKSUM;
+ } else {
+ err = rte_vhost_driver_disable_features(dev->vhost_id,
+ 1ULL << VIRTIO_NET_F_HOST_TSO4
+ | 1ULL << VIRTIO_NET_F_HOST_TSO6
+ | 1ULL << VIRTIO_NET_F_CSUM);
+ if (err) {
+ VLOG_ERR("rte_vhost_driver_disable_features failed for "
+ "vhost user client port: %s\n", dev->up.name);
+ goto unlock;
+ }
}
err = rte_vhost_driver_start(dev->vhost_id);
diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h
index f08159aa7..102548db7 100644
--- a/lib/netdev-linux-private.h
+++ b/lib/netdev-linux-private.h
@@ -37,10 +37,14 @@
struct netdev;
+#define LINUX_RXQ_TSO_MAX_LEN 65536
+
struct netdev_rxq_linux {
struct netdev_rxq up;
bool is_tap;
int fd;
+ char *bufaux; /* Extra buffer to recv TSO pkt. */
+ int bufaux_len; /* Extra buffer length. */
};
int netdev_linux_construct(struct netdev *);
diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
index 8a62f9d74..604cb6913 100644
--- a/lib/netdev-linux.c
+++ b/lib/netdev-linux.c
@@ -29,16 +29,18 @@
#include <linux/filter.h>
#include <linux/gen_stats.h>
#include <linux/if_ether.h>
+#include <linux/if_packet.h>
#include <linux/if_tun.h>
#include <linux/types.h>
#include <linux/ethtool.h>
#include <linux/mii.h>
#include <linux/rtnetlink.h>
#include <linux/sockios.h>
+#include <linux/virtio_net.h>
#include <sys/ioctl.h>
#include <sys/socket.h>
+#include <sys/uio.h>
#include <sys/utsname.h>
-#include <netpacket/packet.h>
#include <net/if.h>
#include <net/if_arp.h>
#include <net/route.h>
@@ -72,6 +74,7 @@
#include "socket-util.h"
#include "sset.h"
#include "tc.h"
+#include "tso.h"
#include "timer.h"
#include "unaligned.h"
#include "openvswitch/vlog.h"
@@ -501,6 +504,8 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5,
20);
* changes in the device miimon status, so we can use atomic_count. */
static atomic_count miimon_cnt = ATOMIC_COUNT_INIT(0);
+static int netdev_linux_parse_vnet_hdr(struct dp_packet *b);
+static void netdev_linux_prepend_vnet_hdr(struct dp_packet *b, int mtu);
static int netdev_linux_do_ethtool(const char *name, struct ethtool_cmd *,
int cmd, const char *cmd_name);
static int get_flags(const struct netdev *, unsigned int *flags);
@@ -902,6 +907,13 @@ netdev_linux_common_construct(struct netdev *netdev_)
/* The device could be in the same network namespace or in another one. */
netnsid_unset(&netdev->netnsid);
ovs_mutex_init(&netdev->mutex);
+
+ if (tso_enabled()) {
+ netdev_->ol_flags |= NETDEV_TX_OFFLOAD_TCP_TSO;
+ netdev_->ol_flags |= NETDEV_TX_OFFLOAD_TCP_CKSUM;
+ netdev_->ol_flags |= NETDEV_TX_OFFLOAD_IPV4_CKSUM;
+ }
+
return 0;
}
@@ -961,6 +973,10 @@ netdev_linux_construct_tap(struct netdev *netdev_)
/* Create tap device. */
get_flags(&netdev->up, &netdev->ifi_flags);
ifr.ifr_flags = IFF_TAP | IFF_NO_PI;
+ if (tso_enabled()) {
+ ifr.ifr_flags |= IFF_VNET_HDR;
+ }
+
ovs_strzcpy(ifr.ifr_name, name, sizeof ifr.ifr_name);
if (ioctl(netdev->tap_fd, TUNSETIFF, &ifr) == -1) {
VLOG_WARN("%s: creating tap device failed: %s", name,
@@ -1024,6 +1040,13 @@ static struct netdev_rxq *
netdev_linux_rxq_alloc(void)
{
struct netdev_rxq_linux *rx = xzalloc(sizeof *rx);
+ if (tso_enabled()) {
+ rx->bufaux = xmalloc(LINUX_RXQ_TSO_MAX_LEN);
+ if (rx->bufaux) {
+ rx->bufaux_len = LINUX_RXQ_TSO_MAX_LEN;
+ }
+ }
+
return &rx->up;
}
@@ -1069,6 +1092,17 @@ netdev_linux_rxq_construct(struct netdev_rxq *rxq_)
goto error;
}
+ if (tso_enabled()) {
+ error = setsockopt(rx->fd, SOL_PACKET, PACKET_VNET_HDR, &val,
+ sizeof val);
+ if (error) {
+ error = errno;
+ VLOG_ERR("%s: failed to enable vnet hdr in txq raw socket: %s",
+ netdev_get_name(netdev_), ovs_strerror(errno));
+ goto error;
+ }
+ }
+
/* Set non-blocking mode. */
error = set_nonblocking(rx->fd);
if (error) {
@@ -1123,6 +1157,8 @@ netdev_linux_rxq_destruct(struct netdev_rxq *rxq_)
if (!rx->is_tap) {
close(rx->fd);
}
+
+ free(rx->bufaux);
}
static void
@@ -1152,11 +1188,13 @@ auxdata_has_vlan_tci(const struct tpacket_auxdata *aux)
}
static int
-netdev_linux_rxq_recv_sock(int fd, struct dp_packet *buffer)
+netdev_linux_rxq_recv_sock(int fd, char *bufaux, int bufaux_len,
+ struct dp_packet *buffer)
{
- size_t size;
+ size_t std_len;
+ size_t total_len;
ssize_t retval;
- struct iovec iov;
+ struct iovec iov[2];
struct cmsghdr *cmsg;
union {
struct cmsghdr cmsg;
@@ -1166,14 +1204,17 @@ netdev_linux_rxq_recv_sock(int fd, struct dp_packet
*buffer)
/* Reserve headroom for a single VLAN tag */
dp_packet_reserve(buffer, VLAN_HEADER_LEN);
- size = dp_packet_tailroom(buffer);
+ std_len = dp_packet_tailroom(buffer);
+ total_len = std_len + bufaux_len;
- iov.iov_base = dp_packet_data(buffer);
- iov.iov_len = size;
+ iov[0].iov_base = dp_packet_data(buffer);
+ iov[0].iov_len = std_len;
+ iov[1].iov_base = bufaux;
+ iov[1].iov_len = bufaux_len;
msgh.msg_name = NULL;
msgh.msg_namelen = 0;
- msgh.msg_iov = &iov;
- msgh.msg_iovlen = 1;
+ msgh.msg_iov = iov;
+ msgh.msg_iovlen = 2;
msgh.msg_control = &cmsg_buffer;
msgh.msg_controllen = sizeof cmsg_buffer;
msgh.msg_flags = 0;
@@ -1184,11 +1225,26 @@ netdev_linux_rxq_recv_sock(int fd, struct dp_packet
*buffer)
if (retval < 0) {
return errno;
- } else if (retval > size) {
+ } else if (retval > total_len) {
return EMSGSIZE;
}
- dp_packet_set_size(buffer, dp_packet_size(buffer) + retval);
+ if (retval > std_len) {
+ /* Build a single linear TSO packet. */
+ size_t extra_len = retval - std_len;
+
+ dp_packet_set_size(buffer, dp_packet_size(buffer) + std_len);
+ dp_packet_prealloc_tailroom(buffer, extra_len);
+ memcpy(dp_packet_tail(buffer), bufaux, extra_len);
+ dp_packet_set_size(buffer, dp_packet_size(buffer) + extra_len);
+ } else {
+ dp_packet_set_size(buffer, dp_packet_size(buffer) + retval);
+ }
+
+ if (tso_enabled() && netdev_linux_parse_vnet_hdr(buffer)) {
+ VLOG_WARN_RL(&rl, "Invalid virtio net header");
+ return EINVAL;
+ }
for (cmsg = CMSG_FIRSTHDR(&msgh); cmsg; cmsg = CMSG_NXTHDR(&msgh, cmsg)) {
const struct tpacket_auxdata *aux;
@@ -1221,20 +1277,44 @@ netdev_linux_rxq_recv_sock(int fd, struct dp_packet
*buffer)
}
static int
-netdev_linux_rxq_recv_tap(int fd, struct dp_packet *buffer)
+netdev_linux_rxq_recv_tap(int fd, char *bufaux, int bufaux_len,
+ struct dp_packet *buffer)
{
ssize_t retval;
- size_t size = dp_packet_tailroom(buffer);
+ size_t std_len;
+ struct iovec iov[2];
+
+ std_len = dp_packet_tailroom(buffer);
+ iov[0].iov_base = dp_packet_data(buffer);
+ iov[0].iov_len = std_len;
+ iov[1].iov_base = bufaux;
+ iov[1].iov_len = bufaux_len;
do {
- retval = read(fd, dp_packet_data(buffer), size);
+ retval = readv(fd, iov, 2);
} while (retval < 0 && errno == EINTR);
if (retval < 0) {
return errno;
}
- dp_packet_set_size(buffer, dp_packet_size(buffer) + retval);
+ if (retval > std_len) {
+ /* Build a single linear TSO packet. */
+ size_t extra_len = retval - std_len;
+
+ dp_packet_set_size(buffer, dp_packet_size(buffer) + std_len);
+ dp_packet_prealloc_tailroom(buffer, extra_len);
+ memcpy(dp_packet_tail(buffer), bufaux, extra_len);
+ dp_packet_set_size(buffer, dp_packet_size(buffer) + extra_len);
+ } else {
+ dp_packet_set_size(buffer, dp_packet_size(buffer) + retval);
+ }
+
+ if (tso_enabled() && netdev_linux_parse_vnet_hdr(buffer)) {
+ VLOG_WARN_RL(&rl, "Invalid virtio net header");
+ return EINVAL;
+ }
+
return 0;
}
@@ -1245,6 +1325,7 @@ netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_);
struct netdev *netdev = rx->up.netdev;
struct dp_packet *buffer;
+ size_t buffer_len;
ssize_t retval;
int mtu;
@@ -1252,12 +1333,18 @@ netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
mtu = ETH_PAYLOAD_MAX;
}
+ buffer_len = VLAN_ETH_HEADER_LEN + mtu;
+ if (tso_enabled()) {
+ buffer_len += sizeof(struct virtio_net_hdr);
+ }
+
/* Assume Ethernet port. No need to set packet_type. */
- buffer = dp_packet_new_with_headroom(VLAN_ETH_HEADER_LEN + mtu,
- DP_NETDEV_HEADROOM);
+ buffer = dp_packet_new_with_headroom(buffer_len, DP_NETDEV_HEADROOM);
retval = (rx->is_tap
- ? netdev_linux_rxq_recv_tap(rx->fd, buffer)
- : netdev_linux_rxq_recv_sock(rx->fd, buffer));
+ ? netdev_linux_rxq_recv_tap(rx->fd, rx->bufaux, rx->bufaux_len,
+ buffer)
+ : netdev_linux_rxq_recv_sock(rx->fd, rx->bufaux, rx->bufaux_len,
+ buffer));
if (retval) {
if (retval != EAGAIN && retval != EMSGSIZE) {
@@ -1302,7 +1389,7 @@ netdev_linux_rxq_drain(struct netdev_rxq *rxq_)
}
static int
-netdev_linux_sock_batch_send(int sock, int ifindex,
+netdev_linux_sock_batch_send(int sock, int ifindex, bool tso, int mtu,
struct dp_packet_batch *batch)
{
const size_t size = dp_packet_batch_size(batch);
@@ -1316,6 +1403,10 @@ netdev_linux_sock_batch_send(int sock, int ifindex,
struct dp_packet *packet;
DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
+ if (tso) {
+ netdev_linux_prepend_vnet_hdr(packet, mtu);
+ }
+
iov[i].iov_base = dp_packet_data(packet);
iov[i].iov_len = dp_packet_size(packet);
mmsg[i].msg_hdr = (struct msghdr) { .msg_name = &sll,
@@ -1348,7 +1439,7 @@ netdev_linux_sock_batch_send(int sock, int ifindex,
* on other interface types because we attach a socket filter to the rx
* socket. */
static int
-netdev_linux_tap_batch_send(struct netdev *netdev_,
+netdev_linux_tap_batch_send(struct netdev *netdev_, bool tso, int mtu,
struct dp_packet_batch *batch)
{
struct netdev_linux *netdev = netdev_linux_cast(netdev_);
@@ -1365,10 +1456,15 @@ netdev_linux_tap_batch_send(struct netdev *netdev_,
}
DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
- size_t size = dp_packet_size(packet);
+ size_t size;
ssize_t retval;
int error;
+ if (tso) {
+ netdev_linux_prepend_vnet_hdr(packet, mtu);
+ }
+
+ size = dp_packet_size(packet);
do {
retval = write(netdev->tap_fd, dp_packet_data(packet), size);
error = retval < 0 ? errno : 0;
@@ -1403,9 +1499,15 @@ netdev_linux_send(struct netdev *netdev_, int qid
OVS_UNUSED,
struct dp_packet_batch *batch,
bool concurrent_txq OVS_UNUSED)
{
+ bool tso = tso_enabled();
+ int mtu = ETH_PAYLOAD_MAX;
int error = 0;
int sock = 0;
+ if (tso) {
+ netdev_linux_get_mtu__(netdev_linux_cast(netdev_), &mtu);
+ }
+
if (!is_tap_netdev(netdev_)) {
if (netdev_linux_netnsid_is_remote(netdev_linux_cast(netdev_))) {
error = EOPNOTSUPP;
@@ -1424,9 +1526,9 @@ netdev_linux_send(struct netdev *netdev_, int qid
OVS_UNUSED,
goto free_batch;
}
- error = netdev_linux_sock_batch_send(sock, ifindex, batch);
+ error = netdev_linux_sock_batch_send(sock, ifindex, tso, mtu, batch);
} else {
- error = netdev_linux_tap_batch_send(netdev_, batch);
+ error = netdev_linux_tap_batch_send(netdev_, tso, mtu, batch);
}
if (error) {
if (error == ENOBUFS) {
@@ -6173,6 +6275,19 @@ af_packet_sock(void)
close(sock);
sock = -error;
}
+
+ if (tso_enabled()) {
+ int val = 1;
+ error = setsockopt(sock, SOL_PACKET, PACKET_VNET_HDR, &val,
+ sizeof val);
+ if (error) {
+ error = errno;
+ VLOG_ERR("failed to enable vnet hdr in raw socket: %s",
+ ovs_strerror(errno));
+ close(sock);
+ sock = -error;
+ }
+ }
} else {
sock = -errno;
VLOG_ERR("failed to create packet socket: %s",
@@ -6183,3 +6298,136 @@ af_packet_sock(void)
return sock;
}
+
+static int
+netdev_linux_parse_l2(struct dp_packet *b, uint16_t *l4proto)
+{
+ struct eth_header *eth_hdr;
+ ovs_be16 eth_type;
+ int l2_len;
+
+ eth_hdr = dp_packet_at(b, 0, ETH_HEADER_LEN);
+ if (!eth_hdr) {
+ return -EINVAL;
+ }
+
+ l2_len = ETH_HEADER_LEN;
+ eth_type = eth_hdr->eth_type;
+ if (eth_type_vlan(eth_type)) {
+ struct vlan_header *vlan = dp_packet_at(b, l2_len, VLAN_HEADER_LEN);
+
+ if (!vlan) {
+ return -EINVAL;
+ }
+
+ eth_type = vlan->vlan_next_type;
+ l2_len += VLAN_HEADER_LEN;
+ }
+
+ if (eth_type == htons(ETH_TYPE_IP)) {
+ struct ip_header *ip_hdr = dp_packet_at(b, l2_len, IP_HEADER_LEN);
+
+ if (!ip_hdr) {
+ return -EINVAL;
+ }
+
+ *l4proto = ip_hdr->ip_proto;
+ dp_packet_hwol_set_tx_ipv4(b);
+ } else if (eth_type == htons(ETH_TYPE_IPV6)) {
+ struct ovs_16aligned_ip6_hdr *nh6;
+
+ nh6 = dp_packet_at(b, l2_len, IPV6_HEADER_LEN);
+ if (!nh6) {
+ return -EINVAL;
+ }
+
+ *l4proto = nh6->ip6_ctlun.ip6_un1.ip6_un1_nxt;
+ dp_packet_hwol_set_tx_ipv6(b);
+ }
+
+ return 0;
+}
+
+static int
+netdev_linux_parse_vnet_hdr(struct dp_packet *b)
+{
+ struct virtio_net_hdr *vnet = dp_packet_pull(b, sizeof *vnet);
+ uint16_t l4proto = 0;
+
+ if (OVS_UNLIKELY(!vnet)) {
+ return -EINVAL;
+ }
+
+ if (vnet->flags == 0 && vnet->gso_type == VIRTIO_NET_HDR_GSO_NONE) {
+ return 0;
+ }
+
+ if (netdev_linux_parse_l2(b, &l4proto)) {
+ return -EINVAL;
+ }
+
+ if (vnet->flags == VIRTIO_NET_HDR_F_NEEDS_CSUM) {
+ if (l4proto == IPPROTO_TCP) {
+ dp_packet_hwol_set_csum_tcp(b);
+ } else if (l4proto == IPPROTO_UDP) {
+ dp_packet_hwol_set_csum_udp(b);
+ } else if (l4proto == IPPROTO_SCTP) {
+ dp_packet_hwol_set_csum_sctp(b);
+ }
+ }
+
+ if (l4proto && vnet->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
+ uint8_t allowed_mask = VIRTIO_NET_HDR_GSO_TCPV4
+ | VIRTIO_NET_HDR_GSO_TCPV6
+ | VIRTIO_NET_HDR_GSO_UDP;
+ uint8_t type = vnet->gso_type & allowed_mask;
+
+ if (type == VIRTIO_NET_HDR_GSO_TCPV4
+ || type == VIRTIO_NET_HDR_GSO_TCPV6) {
+ dp_packet_hwol_set_tcp_seg(b);
+ }
+ }
+
+ return 0;
+}
+
+static void
+netdev_linux_prepend_vnet_hdr(struct dp_packet *b, int mtu)
+{
+ struct virtio_net_hdr *vnet = dp_packet_push_zeros(b, sizeof *vnet);
+
+ if ((dp_packet_size(b) > mtu) && dp_packet_hwol_is_tso(b)) {
+ uint16_t hdr_len = ((char *)dp_packet_l4(b) - (char *)dp_packet_eth(b))
+ + TCP_HEADER_LEN;
+
+ vnet->hdr_len = (OVS_FORCE __virtio16)hdr_len;
+ vnet->gso_size = (OVS_FORCE __virtio16)(mtu - hdr_len);
+ if (dp_packet_hwol_is_ipv4(b)) {
+ vnet->gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
+ } else {
+ vnet->gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
+ }
+
+ } else {
+ vnet->flags = VIRTIO_NET_HDR_GSO_NONE;
+ }
+
+ if (dp_packet_hwol_l4_mask(b)) {
+ vnet->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
+ vnet->csum_start = (OVS_FORCE __virtio16)((char *)dp_packet_l4(b)
+ - (char *)dp_packet_eth(b));
+
+ if (dp_packet_hwol_l4_is_tcp(b)) {
+ vnet->csum_offset = (OVS_FORCE __virtio16) __builtin_offsetof(
+ struct tcp_header, tcp_csum);
+ } else if (dp_packet_hwol_l4_is_udp(b)) {
+ vnet->csum_offset = (OVS_FORCE __virtio16) __builtin_offsetof(
+ struct udp_header, udp_csum);
+ } else if (dp_packet_hwol_l4_is_sctp(b)) {
+ vnet->csum_offset = (OVS_FORCE __virtio16) __builtin_offsetof(
+ struct sctp_header, sctp_csum);
+ } else {
+ VLOG_WARN_RL(&rl, "Unsupported L4 protocol");
+ }
+ }
+}
diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h
index f109c4e66..87c375b47 100644
--- a/lib/netdev-provider.h
+++ b/lib/netdev-provider.h
@@ -37,6 +37,12 @@ extern "C" {
struct netdev_tnl_build_header_params;
#define NETDEV_NUMA_UNSPEC OVS_NUMA_UNSPEC
+enum netdev_ol_flags {
+ NETDEV_TX_OFFLOAD_IPV4_CKSUM = 1 << 0,
+ NETDEV_TX_OFFLOAD_TCP_CKSUM = 1 << 1,
+ NETDEV_TX_OFFLOAD_TCP_TSO = 1 << 2,
+};
+
/* A network device (e.g. an Ethernet device).
*
* Network device implementations may read these members but should not modify
@@ -51,6 +57,10 @@ struct netdev {
* opening this device, and therefore got assigned to the "system" class
*/
bool auto_classified;
+ /* This bitmask of the offloading features enabled/supported by the
+ * supported by the netdev. */
+ uint64_t ol_flags;
+
/* If this is 'true', the user explicitly specified an MTU for this
* netdev. Otherwise, Open vSwitch is allowed to override it. */
bool mtu_user_config;
diff --git a/lib/netdev.c b/lib/netdev.c
index 405c98c68..998525875 100644
--- a/lib/netdev.c
+++ b/lib/netdev.c
@@ -782,6 +782,52 @@ netdev_get_pt_mode(const struct netdev *netdev)
: NETDEV_PT_LEGACY_L2);
}
+/* Check if a 'packet' is compatible with 'netdev_flags'.
+ * If a packet is incompatible, return 'false' with the 'errormsg'
+ * pointing to a reason. */
+static bool
+netdev_send_prepare_packet(const uint64_t netdev_flags,
+ struct dp_packet *packet, char **errormsg)
+{
+ if (dp_packet_hwol_is_tso(packet)
+ && !(netdev_flags & NETDEV_TX_OFFLOAD_TCP_TSO)) {
+ /* Fall back to GSO in software. */
+ *errormsg = "No TSO support";
+ return false;
+ }
+
+ if (dp_packet_hwol_l4_mask(packet)
+ && !(netdev_flags & NETDEV_TX_OFFLOAD_TCP_CKSUM)) {
+ /* Fall back to L4 csum in software. */
+ *errormsg = "No L4 checksum support";
+ return false;
+ }
+
+ return true;
+}
+
+/* Check if each packet in 'batch' is compatible with 'netdev' features,
+ * otherwise either fall back to software implementation or drop it. */
+static void
+netdev_send_prepare_batch(const struct netdev *netdev,
+ struct dp_packet_batch *batch)
+{
+ struct dp_packet *packet;
+ size_t i, size = dp_packet_batch_size(batch);
+
+ DP_PACKET_BATCH_REFILL_FOR_EACH (i, size, packet, batch) {
+ char *errormsg = NULL;
+
+ if (netdev_send_prepare_packet(netdev->ol_flags, packet, &errormsg)) {
+ dp_packet_batch_refill(batch, packet, i);
+ } else {
+ VLOG_WARN_RL(&rl, "%s: Packet dropped: %s",
+ errormsg ? errormsg : "Unsupported feature",
+ netdev_get_name(netdev));
+ }
+ }
+}
+
/* Sends 'batch' on 'netdev'. Returns 0 if successful (for every packet),
* otherwise a positive errno value. Returns EAGAIN without blocking if
* at least one the packets cannot be queued immediately. Returns EMSGSIZE
@@ -811,8 +857,10 @@ int
netdev_send(struct netdev *netdev, int qid, struct dp_packet_batch *batch,
bool concurrent_txq)
{
- int error = netdev->netdev_class->send(netdev, qid, batch,
- concurrent_txq);
+ int error;
+
+ netdev_send_prepare_batch(netdev, batch);
+ error = netdev->netdev_class->send(netdev, qid, batch, concurrent_txq);
if (!error) {
COVERAGE_INC(netdev_sent);
}
@@ -878,9 +926,17 @@ netdev_push_header(const struct netdev *netdev,
const struct ovs_action_push_tnl *data)
{
struct dp_packet *packet;
- DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
- netdev->netdev_class->push_header(netdev, packet, data);
- pkt_metadata_init(&packet->md, data->out_port);
+ size_t i, size = dp_packet_batch_size(batch);
+
+ DP_PACKET_BATCH_REFILL_FOR_EACH (i, size, packet, batch) {
+ if (!dp_packet_hwol_is_tso(packet)) {
+ netdev->netdev_class->push_header(netdev, packet, data);
+ pkt_metadata_init(&packet->md, data->out_port);
+ dp_packet_batch_refill(batch, packet, i);
+ } else {
+ VLOG_WARN_RL(&rl, "%s: Tunneling of TSO packet is not supported: "
+ "packet dropped", netdev_get_name(netdev));
+ }
}
return 0;
diff --git a/lib/tso.c b/lib/tso.c
new file mode 100644
index 000000000..9dc15e146
--- /dev/null
+++ b/lib/tso.c
@@ -0,0 +1,54 @@
+/*
+ * Copyright (c) 2020 Red Hat, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <config.h>
+
+#include "smap.h"
+#include "ovs-thread.h"
+#include "openvswitch/vlog.h"
+#include "dpdk.h"
+#include "tso.h"
+#include "vswitch-idl.h"
+
+VLOG_DEFINE_THIS_MODULE(tso);
+
+static bool tso_support_enabled = false;
+
+void
+tso_init(const struct smap *ovs_other_config)
+{
+ if (smap_get_bool(ovs_other_config, "tso-support", false)) {
+ static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
+
+ if (ovsthread_once_start(&once)) {
+ if (dpdk_available()) {
+ VLOG_INFO("TCP Segmentation Offloading (TSO) support enabled");
+ tso_support_enabled = true;
+ } else {
+ VLOG_ERR("TCP Segmentation Offloading (TSO) is unsupported "
+ "without enabling DPDK");
+ tso_support_enabled = false;
+ }
+ ovsthread_once_done(&once);
+ }
+ }
+}
+
+bool
+tso_enabled(void)
+{
+ return tso_support_enabled;
+}
diff --git a/lib/tso.h b/lib/tso.h
new file mode 100644
index 000000000..6594496ac
--- /dev/null
+++ b/lib/tso.h
@@ -0,0 +1,23 @@
+/*
+ * Copyright (c) 2020 Red Hat Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef TSO_H
+#define TSO_H 1
+
+void tso_init(const struct smap *ovs_other_config);
+bool tso_enabled(void);
+
+#endif /* tso.h */
diff --git a/vswitchd/bridge.c b/vswitchd/bridge.c
index 86c7b10a9..6d73922f6 100644
--- a/vswitchd/bridge.c
+++ b/vswitchd/bridge.c
@@ -65,6 +65,7 @@
#include "system-stats.h"
#include "timeval.h"
#include "tnl-ports.h"
+#include "tso.h"
#include "util.h"
#include "unixctl.h"
#include "lib/vswitch-idl.h"
@@ -3285,6 +3286,7 @@ bridge_run(void)
if (cfg) {
netdev_set_flow_api_enabled(&cfg->other_config);
dpdk_init(&cfg->other_config);
+ tso_init(&cfg->other_config);
}
/* Initialize the ofproto library. This only needs to run once, but
diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
index 0ec726c39..354dcabfa 100644
--- a/vswitchd/vswitch.xml
+++ b/vswitchd/vswitch.xml
@@ -690,6 +690,18 @@
once in few hours or a day or a week.
</p>
</column>
+ <column name="other_config" key="tso-support"
+ type='{"type": "boolean"}'>
+ <p>
+ Set this value to <code>true</code> to enable support for TSO (TCP
+ Segmentation Offloading). When TSO is enabled, vhost-user client
+ interfaces can transmit packets up to 64KB.
+ </p>
+ <p>
+ The default value is <code>false</code>. Changing this value requires
+ restarting the daemon.
+ </p>
+ </column>
</group>
<group title="Status">
<column name="next_cfg">