On 16.01.2020 18:00, Flavio Leitner wrote: > Abbreviated as TSO, TCP Segmentation Offload is a feature which enables > the network stack to delegate the TCP segmentation to the NIC reducing > the per packet CPU overhead. > > A guest using vhostuser interface with TSO enabled can send TCP packets > much bigger than the MTU, which saves CPU cycles normally used to break > the packets down to MTU size and to calculate checksums. > > It also saves CPU cycles used to parse multiple packets/headers during > the packet processing inside virtual switch. > > If the destination of the packet is another guest in the same host, then > the same big packet can be sent through a vhostuser interface skipping > the segmentation completely. However, if the destination is not local, > the NIC hardware is instructed to do the TCP segmentation and checksum > calculation. > > It is recommended to check if NIC hardware supports TSO before enabling > the feature, which is off by default. For additional information please > check the tso.rst document. > > Signed-off-by: Flavio Leitner <[email protected]>
I still didn't check computation of offsets and device configuration. Few comments inline. In general, I'd like if Ben will take a look to this patch, especially at extensive netdev-linux code changes. Also, suggesting to rename the patch to: 'userspace: Add TCP Segmentation Offload support.' Best regards, Ilya maximets. > --- > Documentation/automake.mk | 1 + > Documentation/topics/index.rst | 1 + > Documentation/topics/userspace-tso.rst | 98 +++++++ > NEWS | 1 + > lib/automake.mk | 2 + > lib/conntrack.c | 29 +- > lib/dp-packet.h | 186 +++++++++++- > lib/ipf.c | 32 +- > lib/netdev-dpdk.c | 349 +++++++++++++++++++--- > lib/netdev-linux-private.h | 5 + > lib/netdev-linux.c | 386 ++++++++++++++++++++++--- > lib/netdev-provider.h | 9 + > lib/netdev.c | 78 ++++- > lib/userspace-tso.c | 48 +++ > lib/userspace-tso.h | 23 ++ > vswitchd/bridge.c | 2 + > vswitchd/vswitch.xml | 17 ++ > 17 files changed, 1143 insertions(+), 124 deletions(-) > create mode 100644 Documentation/topics/userspace-tso.rst > create mode 100644 lib/userspace-tso.c > create mode 100644 lib/userspace-tso.h > > Changelog: > - v4 > * rebased on top of master (recvmmsg) > * fixed URL in doc to point to 19.11 > * renamed tso to userspace-tso > * renamed the option to userspace-tso-enable > * removed prototype that left over from v2 > * fixed function style declaration > * renamed dp_packet_hwol_tx_ip_checksum to dp_packet_hwol_tx_ipv4_checksum > * dp_packet_hwol_tx_ipv4_checksum now checks for PKT_TX_IPV4. > * account for drops while preping the batch for TX. > * don't prep the batch for TX if TSO is disabled. > * simplified setsockopt error checking > * fixed af_packet_sock error checking to not call setsockopt on > closed sockets. > * fixed ol_flags comment. > * used VLOG_ERR_BUF() to pass error messages. > * fixed packet leak at netdev_send_prepare_batch() > * added a coverage counter to account drops while preparing a batch > at netdev.c > * fixed netdev_send() to not call ->send() if the batch is empty. > * fixed packet leak at netdev_push_header and account for the drops. > * removed DPDK requirement to enable userspace TSO support. > * fixed parameter documentation in vswitch.xml. > * renamed tso.rst to userspace-tso.rst and moved to topics/ > * added comments documeting the functions in dp-packet.h > * fixed dp_packet_hwol_is_tso to check only PKT_TX_TCP_SEG > > - v3 > * Improved the documentation. > * Updated copyright year to 2020. > * TSO offloaded msg now includes the netdev's name. > * Added period at the end of all code comments. > * Warn and drop encapsulation of TSO packets. > * Fixed travis issue with restricted virtio types. > * Fixed double headroom allocation in dpdk_copy_dp_packet_to_mbuf() > which caused packet corruption. > * Fixed netdev_dpdk_prep_hwol_packet() to unconditionally set > PKT_TX_IP_CKSUM only for IPv4 packets. > > > diff --git a/Documentation/automake.mk b/Documentation/automake.mk > index f2ca17bad..22976a3cd 100644 > --- a/Documentation/automake.mk > +++ b/Documentation/automake.mk > @@ -57,6 +57,7 @@ DOC_SOURCE = \ > Documentation/topics/ovsdb-replication.rst \ > Documentation/topics/porting.rst \ > Documentation/topics/tracing.rst \ > + Documentation/topics/userspace-tso.rst \ > Documentation/topics/windows.rst \ > Documentation/howto/index.rst \ > Documentation/howto/dpdk.rst \ > diff --git a/Documentation/topics/index.rst b/Documentation/topics/index.rst > index 34c4b10e0..08af3a24d 100644 > --- a/Documentation/topics/index.rst > +++ b/Documentation/topics/index.rst > @@ -50,5 +50,6 @@ OVS > language-bindings > testing > tracing > + userspace-tso > idl-compound-indexes > ovs-extensions > diff --git a/Documentation/topics/userspace-tso.rst > b/Documentation/topics/userspace-tso.rst > new file mode 100644 > index 000000000..893c64839 > --- /dev/null > +++ b/Documentation/topics/userspace-tso.rst > @@ -0,0 +1,98 @@ > +.. > + Copyright 2020, Red Hat, Inc. > + > + Licensed under the Apache License, Version 2.0 (the "License"); you may > + not use this file except in compliance with the License. You may obtain > + a copy of the License at > + > + http://www.apache.org/licenses/LICENSE-2.0 > + > + Unless required by applicable law or agreed to in writing, software > + distributed under the License is distributed on an "AS IS" BASIS, > WITHOUT > + WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See > the > + License for the specific language governing permissions and limitations > + under the License. > + > + Convention for heading levels in Open vSwitch documentation: > + > + ======= Heading 0 (reserved for the title in a document) > + ------- Heading 1 > + ~~~~~~~ Heading 2 > + +++++++ Heading 3 > + ''''''' Heading 4 > + > + Avoid deeper levels because they do not render well. > + > +======================== > +Userspace Datapath - TSO > +======================== > + > +**Note:** This feature is considered experimental. > + > +TCP Segmentation Offload (TSO) enables a network stack to delegate > segmentation > +of an oversized TCP segment to the underlying physical NIC. Offload of frame > +segmentation achieves computational savings in the core, freeing up CPU > cycles > +for more useful work. > + > +A common use case for TSO is when using virtualization, where traffic that's > +coming in from a VM can offload the TCP segmentation, thus avoiding the > +fragmentation in software. Additionally, if the traffic is headed to a VM > +within the same host further optimization can be expected. As the traffic > never > +leaves the machine, no MTU needs to be accounted for, and thus no > segmentation > +and checksum calculations are required, which saves yet more cycles. Only > when > +the traffic actually leaves the host the segmentation needs to happen, in > which > +case it will be performed by the egress NIC. Consult your controller's > +datasheet for compatibility. Secondly, the NIC must have an associated DPDK > +Poll Mode Driver (PMD) which supports `TSO`. For a list of features per PMD, > +refer to the `DPDK documentation`__. > + > +__ https://doc.dpdk.org/guides-19.11/nics/overview.html > + > +Enabling TSO > +~~~~~~~~~~~~ > + > +The TSO support may be enabled via a global config value > +``userspace-tso-enable``. Setting this to ``true`` enables TSO support for > +all ports. > + > + $ ovs-vsctl set Open_vSwitch . other_config:userspace-tso-enable=true > + > +The default value is ``false``. > + > +Changing ``userspace-tso-enable`` requires restarting the daemon. > + > +When using :doc:`vHost User ports <dpdk/vhost-user>`, TSO may be enabled > +as follows. > + > +`TSO` is enabled in OvS by the DPDK vHost User backend; when a new guest > +connection is established, `TSO` is thus advertised to the guest as an > +available feature: > + > +QEMU Command Line Parameter:: > + > + $ sudo $QEMU_DIR/x86_64-softmmu/qemu-system-x86_64 \ > + ... > + -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,\ > + csum=on,guest_csum=on,guest_tso4=on,guest_tso6=on\ > + ... > + > +2. Ethtool. Assuming that the guest's OS also supports `TSO`, ethtool can be > +used to enable same:: > + > + $ ethtool -K eth0 sg on # scatter-gather is a prerequisite for TSO > + $ ethtool -K eth0 tso on > + $ ethtool -k eth0 > + > +~~~~~~~~~~~ > +Limitations > +~~~~~~~~~~~ > + > +The current OvS userspace `TSO` implementation supports flat and VLAN > networks > +only (i.e. no support for `TSO` over tunneled connection [VxLAN, GRE, IPinIP, > +etc.]). > + > +There is no software implementation of TSO, so all ports attached to the > +datapath must support TSO or packets using that feature will be dropped > +on ports without TSO support. That also means guests using vhost-user > +in client mode will receive TSO packet regardless of TSO being enabled > +or disabled within the guest. > diff --git a/NEWS b/NEWS > index e8d662a0c..586d81173 100644 > --- a/NEWS > +++ b/NEWS > @@ -26,6 +26,7 @@ Post-v2.12.0 > * DPDK ring ports (dpdkr) are deprecated and will be removed in next > releases. > * Add support for DPDK 19.11. > + * Add experimental support for TSO. > - RSTP: > * The rstp_statistics column in Port table will only be updated every > stats-update-interval configured in Open_vSwtich table. > diff --git a/lib/automake.mk b/lib/automake.mk > index ebf714501..b80de9fc4 100644 > --- a/lib/automake.mk > +++ b/lib/automake.mk > @@ -304,6 +304,8 @@ lib_libopenvswitch_la_SOURCES = \ > lib/tnl-neigh-cache.h \ > lib/tnl-ports.c \ > lib/tnl-ports.h \ > + lib/userspace-tso.c \ > + lib/userspace-tso.h \ Alphabetical order is broken. I guess, could be fixed while applying? > lib/netdev-native-tnl.c \ > lib/netdev-native-tnl.h \ > lib/token-bucket.c \ > diff --git a/lib/conntrack.c b/lib/conntrack.c > index b80080e72..742d2ad4f 100644 > --- a/lib/conntrack.c > +++ b/lib/conntrack.c > @@ -2022,7 +2022,8 @@ conn_key_extract(struct conntrack *ct, struct dp_packet > *pkt, ovs_be16 dl_type, > if (hwol_bad_l3_csum) { > ok = false; > } else { > - bool hwol_good_l3_csum = dp_packet_ip_checksum_valid(pkt); > + bool hwol_good_l3_csum = dp_packet_ip_checksum_valid(pkt) > + || dp_packet_hwol_tx_ipv4_checksum(pkt); > /* Validate the checksum only when hwol is not supported. */ > ok = extract_l3_ipv4(&ctx->key, l3, dp_packet_l3_size(pkt), NULL, > !hwol_good_l3_csum); > @@ -2036,7 +2037,8 @@ conn_key_extract(struct conntrack *ct, struct dp_packet > *pkt, ovs_be16 dl_type, > if (ok) { > bool hwol_bad_l4_csum = dp_packet_l4_checksum_bad(pkt); > if (!hwol_bad_l4_csum) { > - bool hwol_good_l4_csum = dp_packet_l4_checksum_valid(pkt); > + bool hwol_good_l4_csum = dp_packet_l4_checksum_valid(pkt) > + || dp_packet_hwol_tx_l4_checksum(pkt); > /* Validate the checksum only when hwol is not supported. */ > if (extract_l4(&ctx->key, l4, dp_packet_l4_size(pkt), > &ctx->icmp_related, l3, !hwol_good_l4_csum, > @@ -3237,8 +3239,11 @@ handle_ftp_ctl(struct conntrack *ct, const struct > conn_lookup_ctx *ctx, > } > if (seq_skew) { > ip_len = ntohs(l3_hdr->ip_tot_len) + seq_skew; > - l3_hdr->ip_csum = recalc_csum16(l3_hdr->ip_csum, > - l3_hdr->ip_tot_len, htons(ip_len)); > + if (!dp_packet_hwol_tx_ipv4_checksum(pkt)) { > + l3_hdr->ip_csum = recalc_csum16(l3_hdr->ip_csum, > + l3_hdr->ip_tot_len, > + htons(ip_len)); > + } > l3_hdr->ip_tot_len = htons(ip_len); > } > } > @@ -3256,13 +3261,15 @@ handle_ftp_ctl(struct conntrack *ct, const struct > conn_lookup_ctx *ctx, > } > > th->tcp_csum = 0; > - if (ctx->key.dl_type == htons(ETH_TYPE_IPV6)) { > - th->tcp_csum = packet_csum_upperlayer6(nh6, th, ctx->key.nw_proto, > - dp_packet_l4_size(pkt)); > - } else { > - uint32_t tcp_csum = packet_csum_pseudoheader(l3_hdr); > - th->tcp_csum = csum_finish( > - csum_continue(tcp_csum, th, dp_packet_l4_size(pkt))); > + if (!dp_packet_hwol_tx_l4_checksum(pkt)) { > + if (ctx->key.dl_type == htons(ETH_TYPE_IPV6)) { > + th->tcp_csum = packet_csum_upperlayer6(nh6, th, > ctx->key.nw_proto, > + dp_packet_l4_size(pkt)); > + } else { > + uint32_t tcp_csum = packet_csum_pseudoheader(l3_hdr); > + th->tcp_csum = csum_finish( > + csum_continue(tcp_csum, th, dp_packet_l4_size(pkt))); > + } > } > > if (seq_skew) { > diff --git a/lib/dp-packet.h b/lib/dp-packet.h > index 133942155..3e995f505 100644 > --- a/lib/dp-packet.h > +++ b/lib/dp-packet.h > @@ -456,7 +456,7 @@ dp_packet_init_specific(struct dp_packet *p) > { > /* This initialization is needed for packets that do not come from DPDK > * interfaces, when vswitchd is built with --with-dpdk. */ > - p->mbuf.tx_offload = p->mbuf.packet_type = 0; > + p->mbuf.ol_flags = p->mbuf.tx_offload = p->mbuf.packet_type = 0; I'm not very comfortable with this change since dp_packet_init_specific() always called after dp_packet_reset_offloads() that should clear ol_flags. However, since we're not clearing non-offloading flags now, this is needed. I'd concider this as a bug in current code that we're not clearing memory layout related flags in mbuf.ol_flags for packets that comes from non-DPDK ports. This needs to be fixed, maybe by a separate patch for previous OVS branches. > p->mbuf.nb_segs = 1; > p->mbuf.next = NULL; > } > @@ -519,6 +519,96 @@ dp_packet_set_allocated(struct dp_packet *b, uint16_t s) > b->mbuf.buf_len = s; > } > > +/* Return true if packet 'b' offloads TCP segmentation. */ /* Returns 'true' if packet 'b' marked for TCP segmentation offloading. */ ? > +static inline bool > +dp_packet_hwol_is_tso(const struct dp_packet *b) > +{ > + return !!(b->mbuf.ol_flags & PKT_TX_TCP_SEG); > +} > + > +/* Return true if packet 'b' is IPv4. The flag is required when > + * offload is requested. */ /* Returns 'true' if packet 'b' is marked for IPv4 checksum offloading. */ ? > +static inline bool > +dp_packet_hwol_is_ipv4(const struct dp_packet *b) > +{ > + return !!(b->mbuf.ol_flags & PKT_TX_IPV4); > +} > + > +/* Return the L4 cksum offload bitmask. */ > +static inline uint64_t > +dp_packet_hwol_l4_mask(const struct dp_packet *b) > +{ > + return b->mbuf.ol_flags & PKT_TX_L4_MASK; > +} > + > +/* Return true if the packet 'b' offloads TCP checksum calculation. */ /* Returns 'true' if packet 'b' is marked for TCP checksum offloading. */ ? > +static inline bool > +dp_packet_hwol_l4_is_tcp(const struct dp_packet *b) > +{ > + return (b->mbuf.ol_flags & PKT_TX_L4_MASK) == PKT_TX_TCP_CKSUM; > +} > + > +/* Return true if the packet 'b' offloads UDP checksum calculation. */ /* Returns 'true' if packet 'b' is marked for UDP checksum offloading. */ ? > +static inline bool > +dp_packet_hwol_l4_is_udp(struct dp_packet *b) > +{ > + return (b->mbuf.ol_flags & PKT_TX_L4_MASK) == PKT_TX_UDP_CKSUM; > +} > + > +/* Return true if the packet 'b' offloads SCTP checksum calculation. */ /* Returns 'true' if packet 'b' is marked for SCTP checksum offloading. */ ? > +static inline bool > +dp_packet_hwol_l4_is_sctp(struct dp_packet *b) > +{ > + return (b->mbuf.ol_flags & PKT_TX_L4_MASK) == PKT_TX_SCTP_CKSUM; > +} > + > +/* Flag the packet 'b' as IPv4 necessary when offload is used. */ /* Mark packet 'b' for IPv4 checksum offloading. */ ? > +static inline void > +dp_packet_hwol_set_tx_ipv4(struct dp_packet *b) > +{ > + b->mbuf.ol_flags |= PKT_TX_IPV4; > +} > + > +/* Flag the packet 'b' as IPv6 necessary when offload is used. */ /* Mark packet 'b' for IPv6 checksum offloading. */ ? > +static inline void > +dp_packet_hwol_set_tx_ipv6(struct dp_packet *b) > +{ > + b->mbuf.ol_flags |= PKT_TX_IPV6; > +} > + > +/* Request TCP checksum offload for packet 'b'. It implies that > + * either the packet 'b' is flagged as IPv4 or IPv6. */ /* Mark packet 'b' for TCP checksum offloading. It implies that either * the packet 'b' is marked for IPv4 or IPv6 checksum offloading. */ ? > +static inline void > +dp_packet_hwol_set_csum_tcp(struct dp_packet *b) > +{ > + b->mbuf.ol_flags |= PKT_TX_TCP_CKSUM; > +} > + > +/* Request UDP checksum offload for packet 'b'. It implies that > + * either the packet 'b' is flagged as IPv4 or IPv6. */ /* Mark packet 'b' for UDP checksum offloading. It implies that either * the packet 'b' is marked for IPv4 or IPv6 checksum offloading. */ ? > +static inline void > +dp_packet_hwol_set_csum_udp(struct dp_packet *b) > +{ > + b->mbuf.ol_flags |= PKT_TX_UDP_CKSUM; > +} > + > +/* Request SCTP checksum offload for packet 'b'. It implies that > + * either the packet 'b' is flagged as IPv4 or IPv6. */ /* Mark packet 'b' for SCTP checksum offloading. It implies that either * the packet 'b' is marked for IPv4 or IPv6 checksum offloading. */ ? > +static inline void > +dp_packet_hwol_set_csum_sctp(struct dp_packet *b) > +{ > + b->mbuf.ol_flags |= PKT_TX_SCTP_CKSUM; > +} > + > +/* Request TCP segmentation offload for packet 'b'. It implies that > + * either the packet 'b' is flagged as IPv4 or IPv6 and also implies > + * that TCP checksum offload is flagged. */ /* Mark packet 'b' for TCP segmentation offloading. It implies that * either the packet 'b' is marked for IPv4 or IPv6 checksum offloading * and also for TCP checksum offloading. */ ? > +static inline void > +dp_packet_hwol_set_tcp_seg(struct dp_packet *b) > +{ > + b->mbuf.ol_flags |= PKT_TX_TCP_SEG; > +} > + > /* Returns the RSS hash of the packet 'p'. Note that the returned value is > * correct only if 'dp_packet_rss_valid(p)' returns true */ > static inline uint32_t > @@ -648,6 +738,84 @@ dp_packet_set_allocated(struct dp_packet *b, uint16_t s) > b->allocated_ = s; > } > > +/* There are no implementation when not DPDK enabled datapath. */ > +static inline bool > +dp_packet_hwol_is_tso(const struct dp_packet *b OVS_UNUSED) > +{ > + return false; > +} > + > +/* There are no implementation when not DPDK enabled datapath. */ > +static inline bool > +dp_packet_hwol_is_ipv4(const struct dp_packet *b OVS_UNUSED) > +{ > + return false; > +} > + > +/* There are no implementation when not DPDK enabled datapath. */ > +static inline uint64_t > +dp_packet_hwol_l4_mask(const struct dp_packet *b OVS_UNUSED) > +{ > + return 0; > +} > + > +/* There are no implementation when not DPDK enabled datapath. */ > +static inline bool > +dp_packet_hwol_l4_is_tcp(const struct dp_packet *b OVS_UNUSED) > +{ > + return false; > +} > + > +/* There are no implementation when not DPDK enabled datapath. */ > +static inline bool > +dp_packet_hwol_l4_is_udp(const struct dp_packet *b OVS_UNUSED) > +{ > + return false; > +} > + > +/* There are no implementation when not DPDK enabled datapath. */ > +static inline bool > +dp_packet_hwol_l4_is_sctp(const struct dp_packet *b OVS_UNUSED) > +{ > + return false; > +} > + > +/* There are no implementation when not DPDK enabled datapath. */ > +static inline void > +dp_packet_hwol_set_tx_ipv4(struct dp_packet *b OVS_UNUSED) > +{ > +} > + > +/* There are no implementation when not DPDK enabled datapath. */ > +static inline void > +dp_packet_hwol_set_tx_ipv6(struct dp_packet *b OVS_UNUSED) > +{ > +} > + > +/* There are no implementation when not DPDK enabled datapath. */ > +static inline void > +dp_packet_hwol_set_csum_tcp(struct dp_packet *b OVS_UNUSED) > +{ > +} > + > +/* There are no implementation when not DPDK enabled datapath. */ > +static inline void > +dp_packet_hwol_set_csum_udp(struct dp_packet *b OVS_UNUSED) > +{ > +} > + > +/* There are no implementation when not DPDK enabled datapath. */ > +static inline void > +dp_packet_hwol_set_csum_sctp(struct dp_packet *b OVS_UNUSED) > +{ > +} > + > +/* There are no implementation when not DPDK enabled datapath. */ > +static inline void > +dp_packet_hwol_set_tcp_seg(struct dp_packet *b OVS_UNUSED) > +{ > +} > + > /* Returns the RSS hash of the packet 'p'. Note that the returned value is > * correct only if 'dp_packet_rss_valid(p)' returns true */ > static inline uint32_t > @@ -939,6 +1107,22 @@ dp_packet_batch_reset_cutlen(struct dp_packet_batch > *batch) > } > } > > +/* Return true if the packet 'b' requested IPv4 checksum offload. */ > +static inline bool > +dp_packet_hwol_tx_ipv4_checksum(const struct dp_packet *b) Why we need this function that just calls another publically available function without any additional processing? > +{ > + > + return !!dp_packet_hwol_is_ipv4(b); 'dp_packet_hwol_is_ipv4()' returns boolean, no need to !!. > +} > + > +/* Return true if the packet 'b' requested L4 checksum offload. */ > +static inline bool > +dp_packet_hwol_tx_l4_checksum(const struct dp_packet *b) > +{ > + Redundant empty line. > + return !!dp_packet_hwol_l4_mask(b); > +} > + > #ifdef __cplusplus > } > #endif > diff --git a/lib/ipf.c b/lib/ipf.c > index 45c489122..14df04374 100644 > --- a/lib/ipf.c > +++ b/lib/ipf.c > @@ -433,9 +433,11 @@ ipf_reassemble_v4_frags(struct ipf_list *ipf_list) > len += rest_len; > l3 = dp_packet_l3(pkt); > ovs_be16 new_ip_frag_off = l3->ip_frag_off & ~htons(IP_MORE_FRAGMENTS); > - l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_frag_off, > - new_ip_frag_off); > - l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_tot_len, htons(len)); > + if (!dp_packet_hwol_tx_ipv4_checksum(pkt)) { > + l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_frag_off, > + new_ip_frag_off); > + l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_tot_len, htons(len)); > + } > l3->ip_tot_len = htons(len); > l3->ip_frag_off = new_ip_frag_off; > dp_packet_set_l2_pad_size(pkt, 0); > @@ -606,6 +608,7 @@ ipf_is_valid_v4_frag(struct ipf *ipf, struct dp_packet > *pkt) > } > > if (OVS_UNLIKELY(!dp_packet_ip_checksum_valid(pkt) > + && !dp_packet_hwol_tx_ipv4_checksum(pkt) > && csum(l3, ip_hdr_len) != 0)) { > goto invalid_pkt; > } > @@ -1181,16 +1184,21 @@ ipf_post_execute_reass_pkts(struct ipf *ipf, > } else { > struct ip_header *l3_frag = dp_packet_l3(frag_0->pkt); > struct ip_header *l3_reass = dp_packet_l3(pkt); > - ovs_be32 reass_ip = > get_16aligned_be32(&l3_reass->ip_src); > - ovs_be32 frag_ip = get_16aligned_be32(&l3_frag->ip_src); > - l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum, > - frag_ip, reass_ip); > - l3_frag->ip_src = l3_reass->ip_src; > + if (!dp_packet_hwol_tx_ipv4_checksum(frag_0->pkt)) { > + ovs_be32 reass_ip = > + get_16aligned_be32(&l3_reass->ip_src); > + ovs_be32 frag_ip = > + get_16aligned_be32(&l3_frag->ip_src); > + > + l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum, > + frag_ip, reass_ip); > + reass_ip = get_16aligned_be32(&l3_reass->ip_dst); > + frag_ip = get_16aligned_be32(&l3_frag->ip_dst); > + l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum, > + frag_ip, reass_ip); > + } > > - reass_ip = get_16aligned_be32(&l3_reass->ip_dst); > - frag_ip = get_16aligned_be32(&l3_frag->ip_dst); > - l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum, > - frag_ip, reass_ip); > + l3_frag->ip_src = l3_reass->ip_src; > l3_frag->ip_dst = l3_reass->ip_dst; > } > > diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c > index d1469f6f2..48fd6c184 100644 > --- a/lib/netdev-dpdk.c > +++ b/lib/netdev-dpdk.c > @@ -70,6 +70,7 @@ > #include "smap.h" > #include "sset.h" > #include "timeval.h" > +#include "userspace-tso.h" Slightly out of alphabetical order. > #include "unaligned.h" > #include "unixctl.h" > #include "util.h" > @@ -201,6 +202,8 @@ struct netdev_dpdk_sw_stats { > uint64_t tx_qos_drops; > /* Packet drops in ingress policer processing. */ > uint64_t rx_qos_drops; > + /* Packet drops in HWOL processing */ Period at the end of comment. > + uint64_t tx_invalid_hwol_drops; > }; > > enum { DPDK_RING_SIZE = 256 }; > @@ -410,7 +413,8 @@ struct ingress_policer { > enum dpdk_hw_ol_features { > NETDEV_RX_CHECKSUM_OFFLOAD = 1 << 0, > NETDEV_RX_HW_CRC_STRIP = 1 << 1, > - NETDEV_RX_HW_SCATTER = 1 << 2 > + NETDEV_RX_HW_SCATTER = 1 << 2, > + NETDEV_TX_TSO_OFFLOAD = 1 << 3, > }; > > /* > @@ -992,6 +996,12 @@ dpdk_eth_dev_port_config(struct netdev_dpdk *dev, int > n_rxq, int n_txq) > conf.rxmode.offloads |= DEV_RX_OFFLOAD_KEEP_CRC; > } > > + if (dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD) { > + conf.txmode.offloads |= DEV_TX_OFFLOAD_TCP_TSO; > + conf.txmode.offloads |= DEV_TX_OFFLOAD_TCP_CKSUM; > + conf.txmode.offloads |= DEV_TX_OFFLOAD_IPV4_CKSUM; > + } > + > /* Limit configured rss hash functions to only those supported > * by the eth device. */ > conf.rx_adv_conf.rss_conf.rss_hf &= info.flow_type_rss_offloads; > @@ -1093,6 +1103,9 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev) > uint32_t rx_chksm_offload_capa = DEV_RX_OFFLOAD_UDP_CKSUM | > DEV_RX_OFFLOAD_TCP_CKSUM | > DEV_RX_OFFLOAD_IPV4_CKSUM; > + uint32_t tx_tso_offload_capa = DEV_TX_OFFLOAD_TCP_TSO | > + DEV_TX_OFFLOAD_TCP_CKSUM | > + DEV_TX_OFFLOAD_IPV4_CKSUM; > > rte_eth_dev_info_get(dev->port_id, &info); > > @@ -1119,6 +1132,14 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev) > dev->hw_ol_features &= ~NETDEV_RX_HW_SCATTER; > } > > + if (info.tx_offload_capa & tx_tso_offload_capa) { > + dev->hw_ol_features |= NETDEV_TX_TSO_OFFLOAD; > + } else { > + dev->hw_ol_features &= ~NETDEV_TX_TSO_OFFLOAD; > + VLOG_WARN("Tx TSO offload is not supported on %s port " > + DPDK_PORT_ID_FMT, netdev_get_name(&dev->up), dev->port_id); > + } > + > n_rxq = MIN(info.max_rx_queues, dev->up.n_rxq); > n_txq = MIN(info.max_tx_queues, dev->up.n_txq); > > @@ -1369,14 +1390,16 @@ netdev_dpdk_vhost_construct(struct netdev *netdev) > goto out; > } > > - err = rte_vhost_driver_disable_features(dev->vhost_id, > - 1ULL << VIRTIO_NET_F_HOST_TSO4 > - | 1ULL << VIRTIO_NET_F_HOST_TSO6 > - | 1ULL << VIRTIO_NET_F_CSUM); > - if (err) { > - VLOG_ERR("rte_vhost_driver_disable_features failed for vhost user " > - "port: %s\n", name); > - goto out; > + if (!userspace_tso_enabled()) { > + err = rte_vhost_driver_disable_features(dev->vhost_id, > + 1ULL << VIRTIO_NET_F_HOST_TSO4 > + | 1ULL << VIRTIO_NET_F_HOST_TSO6 > + | 1ULL << VIRTIO_NET_F_CSUM); > + if (err) { > + VLOG_ERR("rte_vhost_driver_disable_features failed for vhost > user " > + "port: %s\n", name); > + goto out; > + } > } > > err = rte_vhost_driver_start(dev->vhost_id); > @@ -1711,6 +1734,11 @@ netdev_dpdk_get_config(const struct netdev *netdev, > struct smap *args) > } else { > smap_add(args, "rx_csum_offload", "false"); > } > + if (dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD) { > + smap_add(args, "tx_tso_offload", "true"); > + } else { > + smap_add(args, "tx_tso_offload", "false"); > + } > smap_add(args, "lsc_interrupt_mode", > dev->lsc_interrupt_mode ? "true" : "false"); > } > @@ -2138,6 +2166,67 @@ netdev_dpdk_rxq_dealloc(struct netdev_rxq *rxq) > rte_free(rx); > } > > +/* Prepare the packet for HWOL. > + * Return True if the packet is OK to continue. */ > +static bool > +netdev_dpdk_prep_hwol_packet(struct netdev_dpdk *dev, struct rte_mbuf *mbuf) > +{ > + struct dp_packet *pkt = CONTAINER_OF(mbuf, struct dp_packet, mbuf); > + > + if (mbuf->ol_flags & PKT_TX_L4_MASK) { > + mbuf->l2_len = (char *)dp_packet_l3(pkt) - (char > *)dp_packet_eth(pkt); > + mbuf->l3_len = (char *)dp_packet_l4(pkt) - (char *)dp_packet_l3(pkt); > + mbuf->outer_l2_len = 0; > + mbuf->outer_l3_len = 0; > + } > + > + if (mbuf->ol_flags & PKT_TX_TCP_SEG) { > + struct tcp_header *th = dp_packet_l4(pkt); > + > + if (!th) { > + VLOG_WARN_RL(&rl, "%s: TCP Segmentation without L4 header" > + " pkt len: %"PRIu32"", dev->up.name, mbuf->pkt_len); > + return false; > + } > + > + mbuf->l4_len = TCP_OFFSET(th->tcp_ctl) * 4; > + mbuf->ol_flags |= PKT_TX_TCP_CKSUM; > + mbuf->tso_segsz = dev->mtu - mbuf->l3_len - mbuf->l4_len; > + > + if (mbuf->ol_flags & PKT_TX_IPV4) { > + mbuf->ol_flags |= PKT_TX_IP_CKSUM; > + } > + } > + return true; > +} > + > +/* Prepare a batch for HWOL. > + * Return the number of good packets in the batch. */ > +static int > +netdev_dpdk_prep_hwol_batch(struct netdev_dpdk *dev, struct rte_mbuf **pkts, > + int pkt_cnt) > +{ > + int i = 0; > + int cnt = 0; > + struct rte_mbuf *pkt; > + > + /* Prepare and filter bad HWOL packets. */ > + for (i = 0; i < pkt_cnt; i++) { > + pkt = pkts[i]; > + if (!netdev_dpdk_prep_hwol_packet(dev, pkt)) { > + rte_pktmbuf_free(pkt); > + continue; > + } > + > + if (OVS_UNLIKELY(i != cnt)) { > + pkts[cnt] = pkt; > + } > + cnt++; > + } > + > + return cnt; > +} > + > /* Tries to transmit 'pkts' to txq 'qid' of device 'dev'. Takes ownership of > * 'pkts', even in case of failure. > * > @@ -2147,11 +2236,22 @@ netdev_dpdk_eth_tx_burst(struct netdev_dpdk *dev, int > qid, > struct rte_mbuf **pkts, int cnt) > { > uint32_t nb_tx = 0; > + uint16_t nb_tx_prep = cnt; > + > + if (userspace_tso_enabled()) { > + nb_tx_prep = rte_eth_tx_prepare(dev->port_id, qid, pkts, cnt); > + if (nb_tx_prep != cnt) { > + VLOG_WARN_RL(&rl, "%s: Output batch contains invalid packets. " > + "Only %u/%u are valid: %s", dev->up.name, > nb_tx_prep, > + cnt, rte_strerror(rte_errno)); > + } > + } > > - while (nb_tx != cnt) { > + while (nb_tx != nb_tx_prep) { > uint32_t ret; > > - ret = rte_eth_tx_burst(dev->port_id, qid, pkts + nb_tx, cnt - nb_tx); > + ret = rte_eth_tx_burst(dev->port_id, qid, pkts + nb_tx, > + nb_tx_prep - nb_tx); > if (!ret) { > break; > } > @@ -2437,11 +2537,14 @@ netdev_dpdk_filter_packet_len(struct netdev_dpdk > *dev, struct rte_mbuf **pkts, > int cnt = 0; > struct rte_mbuf *pkt; > > + /* Filter oversized packets, unless are marked for TSO. */ > for (i = 0; i < pkt_cnt; i++) { > pkt = pkts[i]; > - if (OVS_UNLIKELY(pkt->pkt_len > dev->max_packet_len)) { > - VLOG_WARN_RL(&rl, "%s: Too big size %" PRIu32 " max_packet_len > %d", > - dev->up.name, pkt->pkt_len, dev->max_packet_len); > + if (OVS_UNLIKELY((pkt->pkt_len > dev->max_packet_len) > + && !(pkt->ol_flags & PKT_TX_TCP_SEG))) { > + VLOG_WARN_RL(&rl, "%s: Too big size %" PRIu32 " " > + "max_packet_len %d", dev->up.name, pkt->pkt_len, > + dev->max_packet_len); > rte_pktmbuf_free(pkt); > continue; > } > @@ -2463,7 +2566,8 @@ netdev_dpdk_vhost_update_tx_counters(struct netdev_dpdk > *dev, > { > int dropped = sw_stats_add->tx_mtu_exceeded_drops + > sw_stats_add->tx_qos_drops + > - sw_stats_add->tx_failure_drops; > + sw_stats_add->tx_failure_drops + > + sw_stats_add->tx_invalid_hwol_drops; > struct netdev_stats *stats = &dev->stats; > int sent = attempted - dropped; > int i; > @@ -2482,6 +2586,7 @@ netdev_dpdk_vhost_update_tx_counters(struct netdev_dpdk > *dev, > sw_stats->tx_failure_drops += sw_stats_add->tx_failure_drops; > sw_stats->tx_mtu_exceeded_drops += > sw_stats_add->tx_mtu_exceeded_drops; > sw_stats->tx_qos_drops += sw_stats_add->tx_qos_drops; > + sw_stats->tx_invalid_hwol_drops += > sw_stats_add->tx_invalid_hwol_drops; > } > } > > @@ -2513,8 +2618,15 @@ __netdev_dpdk_vhost_send(struct netdev *netdev, int > qid, > rte_spinlock_lock(&dev->tx_q[qid].tx_lock); > } > > + sw_stats_add.tx_invalid_hwol_drops = cnt; > + if (userspace_tso_enabled()) { > + cnt = netdev_dpdk_prep_hwol_batch(dev, cur_pkts, cnt); > + } > + > + sw_stats_add.tx_invalid_hwol_drops -= cnt; > + sw_stats_add.tx_mtu_exceeded_drops = cnt; > cnt = netdev_dpdk_filter_packet_len(dev, cur_pkts, cnt); > - sw_stats_add.tx_mtu_exceeded_drops = total_packets - cnt; > + sw_stats_add.tx_mtu_exceeded_drops -= cnt; > > /* Check has QoS has been configured for the netdev */ > sw_stats_add.tx_qos_drops = cnt; > @@ -2562,6 +2674,121 @@ out: > } > } > > +static void > +netdev_dpdk_extbuf_free(void *addr OVS_UNUSED, void *opaque) > +{ > + rte_free(opaque); > +} > + > +static struct rte_mbuf * > +dpdk_pktmbuf_attach_extbuf(struct rte_mbuf *pkt, uint32_t data_len) > +{ > + uint32_t total_len = RTE_PKTMBUF_HEADROOM + data_len; > + struct rte_mbuf_ext_shared_info *shinfo = NULL; > + uint16_t buf_len; > + void *buf; > + > + if (rte_pktmbuf_tailroom(pkt) >= sizeof(*shinfo)) { Please, don't parenthesize argument of sizeof if it's a variable. > + shinfo = rte_pktmbuf_mtod(pkt, struct rte_mbuf_ext_shared_info *); > + } else { > + total_len += sizeof(*shinfo) + sizeof(uintptr_t); Ditto. > + total_len = RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t)); > + } > + > + if (unlikely(total_len > UINT16_MAX)) { > + VLOG_ERR("Can't copy packet: too big %u", total_len); > + return NULL; > + } > + > + buf_len = total_len; > + buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE); > + if (unlikely(buf == NULL)) { OVS_UNLIKELY > + VLOG_ERR("Failed to allocate memory using rte_malloc: %u", buf_len); > + return NULL; > + } > + > + /* Initialize shinfo. */ > + if (shinfo) { > + shinfo->free_cb = netdev_dpdk_extbuf_free; > + shinfo->fcb_opaque = buf; > + rte_mbuf_ext_refcnt_set(shinfo, 1); > + } else { > + shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len, > + netdev_dpdk_extbuf_free, > + buf); > + if (unlikely(shinfo == NULL)) { OVS_UNLIKELY > + rte_free(buf); > + VLOG_ERR("Failed to initialize shared info for mbuf while " > + "attempting to attach an external buffer."); > + return NULL; > + } > + } > + > + rte_pktmbuf_attach_extbuf(pkt, buf, rte_malloc_virt2iova(buf), buf_len, > + shinfo); > + rte_pktmbuf_reset_headroom(pkt); > + > + return pkt; > +} > + > +static struct rte_mbuf * > +dpdk_pktmbuf_alloc(struct rte_mempool *mp, uint32_t data_len) > +{ > + struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp); > + > + if (OVS_UNLIKELY(!pkt)) { > + return NULL; > + } > + > + dp_packet_init_specific((struct dp_packet *)pkt); Why this needed? rte_pktmbuf_alloc() always resets mbuf clearing all the required fields. > + if (rte_pktmbuf_tailroom(pkt) >= data_len) { > + return pkt; > + } > + > + if (dpdk_pktmbuf_attach_extbuf(pkt, data_len)) { > + return pkt; > + } > + > + rte_pktmbuf_free(pkt); > + > + return NULL; > +} > + > +static struct dp_packet * > +dpdk_copy_dp_packet_to_mbuf(struct rte_mempool *mp, struct dp_packet > *pkt_orig) > +{ > + struct rte_mbuf *mbuf_dest; > + struct dp_packet *pkt_dest; > + uint32_t pkt_len; > + > + pkt_len = dp_packet_size(pkt_orig); > + mbuf_dest = dpdk_pktmbuf_alloc(mp, pkt_len); > + if (OVS_UNLIKELY(mbuf_dest == NULL)) { > + return NULL; > + } > + > + pkt_dest = CONTAINER_OF(mbuf_dest, struct dp_packet, mbuf); > + memcpy(dp_packet_data(pkt_dest), dp_packet_data(pkt_orig), pkt_len); > + dp_packet_set_size(pkt_dest, pkt_len); > + > + mbuf_dest->tx_offload = pkt_orig->mbuf.tx_offload; > + mbuf_dest->packet_type = pkt_orig->mbuf.packet_type; > + mbuf_dest->ol_flags |= (pkt_orig->mbuf.ol_flags & > + ~(EXT_ATTACHED_MBUF | IND_ATTACHED_MBUF)); > + > + memcpy(&pkt_dest->l2_pad_size, &pkt_orig->l2_pad_size, > + sizeof(struct dp_packet) - offsetof(struct dp_packet, > l2_pad_size)); Above looks like a code duplication with dp_packet_clone_with_headroom(). It might be better to strip common parts of dp_packet_clone_with_headroom() to a separate function and call it here. However this might be done as a separate patch later. > + > + if (mbuf_dest->ol_flags & PKT_TX_L4_MASK) { > + mbuf_dest->l2_len = (char *)dp_packet_l3(pkt_dest) > + - (char *)dp_packet_eth(pkt_dest); > + mbuf_dest->l3_len = (char *)dp_packet_l4(pkt_dest) > + - (char *) dp_packet_l3(pkt_dest); > + } > + > + return pkt_dest; > +} > + > /* Tx function. Transmit packets indefinitely */ > static void > dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch > *batch) > @@ -2575,7 +2802,7 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid, struct > dp_packet_batch *batch) > enum { PKT_ARRAY_SIZE = NETDEV_MAX_BURST }; > #endif > struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); > - struct rte_mbuf *pkts[PKT_ARRAY_SIZE]; > + struct dp_packet *pkts[PKT_ARRAY_SIZE]; > struct netdev_dpdk_sw_stats *sw_stats = dev->sw_stats; > uint32_t cnt = batch_cnt; > uint32_t dropped = 0; > @@ -2596,34 +2823,30 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid, > struct dp_packet_batch *batch) > struct dp_packet *packet = batch->packets[i]; > uint32_t size = dp_packet_size(packet); > > - if (OVS_UNLIKELY(size > dev->max_packet_len)) { > - VLOG_WARN_RL(&rl, "Too big size %u max_packet_len %d", > - size, dev->max_packet_len); > - > + if (size > dev->max_packet_len > + && !(packet->mbuf.ol_flags & PKT_TX_TCP_SEG)) { > + VLOG_WARN_RL(&rl, "Too big size %u max_packet_len %d", size, > + dev->max_packet_len); > mtu_drops++; > continue; > } > > - pkts[txcnt] = rte_pktmbuf_alloc(dev->dpdk_mp->mp); > + pkts[txcnt] = dpdk_copy_dp_packet_to_mbuf(dev->dpdk_mp->mp, packet); > if (OVS_UNLIKELY(!pkts[txcnt])) { > dropped = cnt - i; > break; > } > > - /* We have to do a copy for now */ > - memcpy(rte_pktmbuf_mtod(pkts[txcnt], void *), > - dp_packet_data(packet), size); > - dp_packet_set_size((struct dp_packet *)pkts[txcnt], size); > - > txcnt++; > } > > if (OVS_LIKELY(txcnt)) { > if (dev->type == DPDK_DEV_VHOST) { > - __netdev_dpdk_vhost_send(netdev, qid, (struct dp_packet **) pkts, > - txcnt); > + __netdev_dpdk_vhost_send(netdev, qid, pkts, txcnt); > } else { > - tx_failure = netdev_dpdk_eth_tx_burst(dev, qid, pkts, txcnt); > + tx_failure += netdev_dpdk_eth_tx_burst(dev, qid, > + (struct rte_mbuf **)pkts, > + txcnt); > } > } > > @@ -2676,26 +2899,33 @@ netdev_dpdk_send__(struct netdev_dpdk *dev, int qid, > dp_packet_delete_batch(batch, true); > } else { > struct netdev_dpdk_sw_stats *sw_stats = dev->sw_stats; > - int tx_cnt, dropped; > - int tx_failure, mtu_drops, qos_drops; > + int dropped; > + int tx_failure, mtu_drops, qos_drops, hwol_drops; > int batch_cnt = dp_packet_batch_size(batch); > struct rte_mbuf **pkts = (struct rte_mbuf **) batch->packets; > > - tx_cnt = netdev_dpdk_filter_packet_len(dev, pkts, batch_cnt); > - mtu_drops = batch_cnt - tx_cnt; > - qos_drops = tx_cnt; > - tx_cnt = netdev_dpdk_qos_run(dev, pkts, tx_cnt, true); > - qos_drops -= tx_cnt; > + hwol_drops = batch_cnt; > + if (userspace_tso_enabled()) { > + batch_cnt = netdev_dpdk_prep_hwol_batch(dev, pkts, batch_cnt); > + } > + hwol_drops -= batch_cnt; > + mtu_drops = batch_cnt; > + batch_cnt = netdev_dpdk_filter_packet_len(dev, pkts, batch_cnt); > + mtu_drops -= batch_cnt; > + qos_drops = batch_cnt; > + batch_cnt = netdev_dpdk_qos_run(dev, pkts, batch_cnt, true); > + qos_drops -= batch_cnt; > > - tx_failure = netdev_dpdk_eth_tx_burst(dev, qid, pkts, tx_cnt); > + tx_failure = netdev_dpdk_eth_tx_burst(dev, qid, pkts, batch_cnt); > > - dropped = tx_failure + mtu_drops + qos_drops; > + dropped = tx_failure + mtu_drops + qos_drops + hwol_drops; > if (OVS_UNLIKELY(dropped)) { > rte_spinlock_lock(&dev->stats_lock); > dev->stats.tx_dropped += dropped; > sw_stats->tx_failure_drops += tx_failure; > sw_stats->tx_mtu_exceeded_drops += mtu_drops; > sw_stats->tx_qos_drops += qos_drops; > + sw_stats->tx_invalid_hwol_drops += hwol_drops; > rte_spinlock_unlock(&dev->stats_lock); > } > } > @@ -3011,7 +3241,8 @@ netdev_dpdk_get_sw_custom_stats(const struct netdev > *netdev, > SW_CSTAT(tx_failure_drops) \ > SW_CSTAT(tx_mtu_exceeded_drops) \ > SW_CSTAT(tx_qos_drops) \ > - SW_CSTAT(rx_qos_drops) > + SW_CSTAT(rx_qos_drops) \ > + SW_CSTAT(tx_invalid_hwol_drops) > > #define SW_CSTAT(NAME) + 1 > custom_stats->size = SW_CSTATS; > @@ -4874,6 +5105,12 @@ netdev_dpdk_reconfigure(struct netdev *netdev) > > rte_free(dev->tx_q); > err = dpdk_eth_dev_init(dev); > + if (dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD) { > + netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_TSO; > + netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_CKSUM; > + netdev->ol_flags |= NETDEV_TX_OFFLOAD_IPV4_CKSUM; > + } > + > dev->tx_q = netdev_dpdk_alloc_txq(netdev->n_txq); > if (!dev->tx_q) { > err = ENOMEM; > @@ -4903,6 +5140,11 @@ dpdk_vhost_reconfigure_helper(struct netdev_dpdk *dev) > dev->tx_q[0].map = 0; > } > > + if (userspace_tso_enabled()) { > + dev->hw_ol_features |= NETDEV_TX_TSO_OFFLOAD; > + VLOG_DBG("%s: TSO enabled on vhost port", netdev_get_name(&dev->up)); > + } > + > netdev_dpdk_remap_txqs(dev); > > err = netdev_dpdk_mempool_configure(dev); > @@ -4975,6 +5217,11 @@ netdev_dpdk_vhost_client_reconfigure(struct netdev > *netdev) > vhost_flags |= RTE_VHOST_USER_DEQUEUE_ZERO_COPY; > } > > + /* Enable External Buffers if TCP Segmentation Offload is enabled. */ > + if (userspace_tso_enabled()) { > + vhost_flags |= RTE_VHOST_USER_EXTBUF_SUPPORT; > + } > + > err = rte_vhost_driver_register(dev->vhost_id, vhost_flags); > if (err) { > VLOG_ERR("vhost-user device setup failure for device %s\n", > @@ -4999,14 +5246,20 @@ netdev_dpdk_vhost_client_reconfigure(struct netdev > *netdev) > goto unlock; > } > > - err = rte_vhost_driver_disable_features(dev->vhost_id, > - 1ULL << VIRTIO_NET_F_HOST_TSO4 > - | 1ULL << VIRTIO_NET_F_HOST_TSO6 > - | 1ULL << VIRTIO_NET_F_CSUM); > - if (err) { > - VLOG_ERR("rte_vhost_driver_disable_features failed for vhost > user " > - "client port: %s\n", dev->up.name); > - goto unlock; > + if (userspace_tso_enabled()) { > + netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_TSO; > + netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_CKSUM; > + netdev->ol_flags |= NETDEV_TX_OFFLOAD_IPV4_CKSUM; > + } else { > + err = rte_vhost_driver_disable_features(dev->vhost_id, > + 1ULL << VIRTIO_NET_F_HOST_TSO4 > + | 1ULL << VIRTIO_NET_F_HOST_TSO6 > + | 1ULL << VIRTIO_NET_F_CSUM); > + if (err) { > + VLOG_ERR("rte_vhost_driver_disable_features failed for " > + "vhost user client port: %s\n", dev->up.name); > + goto unlock; > + } > } > > err = rte_vhost_driver_start(dev->vhost_id); > diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h > index f08159aa7..9dbc67658 100644 > --- a/lib/netdev-linux-private.h > +++ b/lib/netdev-linux-private.h > @@ -27,6 +27,7 @@ > #include <stdint.h> > #include <stdbool.h> > > +#include "dp-packet.h" > #include "netdev-afxdp.h" > #include "netdev-afxdp-pool.h" > #include "netdev-provider.h" > @@ -37,10 +38,13 @@ > > struct netdev; > > +#define LINUX_RXQ_TSO_MAX_LEN 65536 > + > struct netdev_rxq_linux { > struct netdev_rxq up; > bool is_tap; > int fd; > + char *aux_bufs[NETDEV_MAX_BURST]; /* Batch of preallocated TSO buffers. > */ > }; > > int netdev_linux_construct(struct netdev *); > @@ -92,6 +96,7 @@ struct netdev_linux { > int tap_fd; > bool present; /* If the device is present in the namespace > */ > uint64_t tx_dropped; /* tap device can drop if the iface is down > */ > + uint64_t rx_dropped; /* Packets dropped while recv from kernel. */ > > /* LAG information. */ > bool is_lag_master; /* True if the netdev is a LAG master. */ > diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c > index 41d1e9273..c308abf54 100644 > --- a/lib/netdev-linux.c > +++ b/lib/netdev-linux.c > @@ -29,16 +29,18 @@ > #include <linux/filter.h> > #include <linux/gen_stats.h> > #include <linux/if_ether.h> > +#include <linux/if_packet.h> > #include <linux/if_tun.h> > #include <linux/types.h> > #include <linux/ethtool.h> > #include <linux/mii.h> > #include <linux/rtnetlink.h> > #include <linux/sockios.h> > +#include <linux/virtio_net.h> > #include <sys/ioctl.h> > #include <sys/socket.h> > +#include <sys/uio.h> > #include <sys/utsname.h> > -#include <netpacket/packet.h> > #include <net/if.h> > #include <net/if_arp.h> > #include <net/route.h> > @@ -72,6 +74,7 @@ > #include "socket-util.h" > #include "sset.h" > #include "tc.h" > +#include "userspace-tso.h" Alphabetical order. > #include "timer.h" > #include "unaligned.h" > #include "openvswitch/vlog.h" > @@ -237,6 +240,16 @@ enum { > VALID_DRVINFO = 1 << 6, > VALID_FEATURES = 1 << 7, > }; > + > +/* Use one for the packet buffer and another for the aux buffer to receive > + * TSO packets. */ > +#define IOV_STD_SIZE 1 > +#define IOV_TSO_SIZE 2 > + > +enum { > + IOV_PACKET = 0, > + IOV_AUXBUF = 1, > +}; > > struct linux_lag_slave { > uint32_t block_id; > @@ -501,6 +514,8 @@ static struct vlog_rate_limit rl = > VLOG_RATE_LIMIT_INIT(5, 20); > * changes in the device miimon status, so we can use atomic_count. */ > static atomic_count miimon_cnt = ATOMIC_COUNT_INIT(0); > > +static int netdev_linux_parse_vnet_hdr(struct dp_packet *b); > +static void netdev_linux_prepend_vnet_hdr(struct dp_packet *b, int mtu); > static int netdev_linux_do_ethtool(const char *name, struct ethtool_cmd *, > int cmd, const char *cmd_name); > static int get_flags(const struct netdev *, unsigned int *flags); > @@ -902,6 +917,13 @@ netdev_linux_common_construct(struct netdev *netdev_) > /* The device could be in the same network namespace or in another one. > */ > netnsid_unset(&netdev->netnsid); > ovs_mutex_init(&netdev->mutex); > + > + if (userspace_tso_enabled()) { > + netdev_->ol_flags |= NETDEV_TX_OFFLOAD_TCP_TSO; > + netdev_->ol_flags |= NETDEV_TX_OFFLOAD_TCP_CKSUM; > + netdev_->ol_flags |= NETDEV_TX_OFFLOAD_IPV4_CKSUM; > + } > + > return 0; > } > > @@ -961,6 +983,10 @@ netdev_linux_construct_tap(struct netdev *netdev_) > /* Create tap device. */ > get_flags(&netdev->up, &netdev->ifi_flags); > ifr.ifr_flags = IFF_TAP | IFF_NO_PI; > + if (userspace_tso_enabled()) { > + ifr.ifr_flags |= IFF_VNET_HDR; > + } > + > ovs_strzcpy(ifr.ifr_name, name, sizeof ifr.ifr_name); > if (ioctl(netdev->tap_fd, TUNSETIFF, &ifr) == -1) { > VLOG_WARN("%s: creating tap device failed: %s", name, > @@ -1024,6 +1050,15 @@ static struct netdev_rxq * > netdev_linux_rxq_alloc(void) > { > struct netdev_rxq_linux *rx = xzalloc(sizeof *rx); > + if (userspace_tso_enabled()) { > + int i; > + > + /* Allocate auxiliay buffers to receive TSO packets */ Period at the end of comment. > + for (i = 0; i < NETDEV_MAX_BURST; i++) { > + rx->aux_bufs[i] = xmalloc(LINUX_RXQ_TSO_MAX_LEN); > + } > + } > + > return &rx->up; > } > > @@ -1069,6 +1104,15 @@ netdev_linux_rxq_construct(struct netdev_rxq *rxq_) > goto error; > } > > + if (userspace_tso_enabled() > + && setsockopt(rx->fd, SOL_PACKET, PACKET_VNET_HDR, &val, > + sizeof val)) { > + error = errno; > + VLOG_ERR("%s: failed to enable vnet hdr in txq raw socket: %s", > + netdev_get_name(netdev_), ovs_strerror(errno)); > + goto error; > + } > + > /* Set non-blocking mode. */ > error = set_nonblocking(rx->fd); > if (error) { > @@ -1119,10 +1163,15 @@ static void > netdev_linux_rxq_destruct(struct netdev_rxq *rxq_) > { > struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_); > + int i; > > if (!rx->is_tap) { > close(rx->fd); > } > + > + for (i = 0; i < NETDEV_MAX_BURST; i++) { > + free(rx->aux_bufs[i]); > + } > } > > static void > @@ -1159,12 +1208,14 @@ auxdata_has_vlan_tci(const struct tpacket_auxdata > *aux) > * It also used recvmmsg to reduce multiple syscalls overhead; > */ > static int > -netdev_linux_batch_rxq_recv_sock(int fd, int mtu, > +netdev_linux_batch_rxq_recv_sock(struct netdev_rxq_linux *rx, int mtu, > struct dp_packet_batch *batch) > { > - size_t size; > + int iovlen; > + size_t std_len; > ssize_t retval; > - struct iovec iovs[NETDEV_MAX_BURST]; > + int virtio_net_hdr_size; > + struct iovec iovs[NETDEV_MAX_BURST][IOV_TSO_SIZE]; > struct cmsghdr *cmsg; > union { > struct cmsghdr cmsg; > @@ -1174,41 +1225,87 @@ netdev_linux_batch_rxq_recv_sock(int fd, int mtu, > struct dp_packet *buffers[NETDEV_MAX_BURST]; > int i; > > + if (userspace_tso_enabled()) { > + /* Use the buffer from the allocated packet below to receive MTU > + * sized packets and an aux_buf for extra TSO data. */ > + iovlen = IOV_TSO_SIZE; > + virtio_net_hdr_size = sizeof(struct virtio_net_hdr); > + } else { > + /* Use only the buffer from the allocated packet. */ > + iovlen = IOV_STD_SIZE; > + virtio_net_hdr_size = 0; > + } > + > + std_len = VLAN_ETH_HEADER_LEN + mtu + virtio_net_hdr_size; > for (i = 0; i < NETDEV_MAX_BURST; i++) { > - buffers[i] = dp_packet_new_with_headroom(VLAN_ETH_HEADER_LEN + mtu, > - DP_NETDEV_HEADROOM); > - /* Reserve headroom for a single VLAN tag */ > - dp_packet_reserve(buffers[i], VLAN_HEADER_LEN); > - size = dp_packet_tailroom(buffers[i]); > - iovs[i].iov_base = dp_packet_data(buffers[i]); > - iovs[i].iov_len = size; > + buffers[i] = dp_packet_new_with_headroom(std_len, > DP_NETDEV_HEADROOM); > + iovs[i][IOV_PACKET].iov_base = dp_packet_data(buffers[i]); > + iovs[i][IOV_PACKET].iov_len = std_len; > + iovs[i][IOV_AUXBUF].iov_base = rx->aux_bufs[i]; > + iovs[i][IOV_AUXBUF].iov_len = LINUX_RXQ_TSO_MAX_LEN; > mmsgs[i].msg_hdr.msg_name = NULL; > mmsgs[i].msg_hdr.msg_namelen = 0; > - mmsgs[i].msg_hdr.msg_iov = &iovs[i]; > - mmsgs[i].msg_hdr.msg_iovlen = 1; > + mmsgs[i].msg_hdr.msg_iov = iovs[i]; > + mmsgs[i].msg_hdr.msg_iovlen = iovlen; > mmsgs[i].msg_hdr.msg_control = &cmsg_buffers[i]; > mmsgs[i].msg_hdr.msg_controllen = sizeof cmsg_buffers[i]; > mmsgs[i].msg_hdr.msg_flags = 0; > } > > do { > - retval = recvmmsg(fd, mmsgs, NETDEV_MAX_BURST, MSG_TRUNC, NULL); > + retval = recvmmsg(rx->fd, mmsgs, NETDEV_MAX_BURST, MSG_TRUNC, NULL); > } while (retval < 0 && errno == EINTR); > > if (retval < 0) { > - /* Save -errno to retval temporarily */ > - retval = -errno; > - i = 0; > - goto free_buffers; > + retval = errno; > + for (i = 0; i < NETDEV_MAX_BURST; i++) { > + dp_packet_delete(buffers[i]); > + } > + > + return retval; > } > > for (i = 0; i < retval; i++) { > if (mmsgs[i].msg_len < ETH_HEADER_LEN) { > - break; > + struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up); > + struct netdev_linux *netdev = netdev_linux_cast(netdev_); > + > + dp_packet_delete(buffers[i]); > + netdev->rx_dropped += 1; > + VLOG_WARN_RL(&rl, "%s: Dropped packet: less than ether hdr size", > + netdev_get_name(netdev_)); > + continue; > + } > + > + if (mmsgs[i].msg_len > std_len) { > + /* Build a single linear TSO packet by expanding the current > packet > + * to append the data received in the aux_buf. */ > + size_t extra_len = mmsgs[i].msg_len - std_len; > + > + dp_packet_set_size(buffers[i], dp_packet_size(buffers[i]) > + + std_len); > + dp_packet_prealloc_tailroom(buffers[i], extra_len); > + memcpy(dp_packet_tail(buffers[i]), rx->aux_bufs[i], extra_len); > + dp_packet_set_size(buffers[i], dp_packet_size(buffers[i]) > + + extra_len); > + } else { > + dp_packet_set_size(buffers[i], dp_packet_size(buffers[i]) > + + mmsgs[i].msg_len); > } > > - dp_packet_set_size(buffers[i], > - dp_packet_size(buffers[i]) + mmsgs[i].msg_len); > + if (virtio_net_hdr_size && netdev_linux_parse_vnet_hdr(buffers[i])) { > + struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up); > + struct netdev_linux *netdev = netdev_linux_cast(netdev_); > + > + /* Unexpected error situation: the virtio header is not present > + * or corrupted. Drop the packet but continue in case next ones > + * are correct. */ > + dp_packet_delete(buffers[i]); > + netdev->rx_dropped += 1; > + VLOG_WARN_RL(&rl, "%s: Dropped packet: Invalid virtio net > header", > + netdev_get_name(netdev_)); > + continue; > + } > > for (cmsg = CMSG_FIRSTHDR(&mmsgs[i].msg_hdr); cmsg; > cmsg = CMSG_NXTHDR(&mmsgs[i].msg_hdr, cmsg)) { > @@ -1238,22 +1335,11 @@ netdev_linux_batch_rxq_recv_sock(int fd, int mtu, > dp_packet_batch_add(batch, buffers[i]); > } > > -free_buffers: > - /* Free unused buffers, including buffers whose size is less than > - * ETH_HEADER_LEN. > - * > - * Note: i has been set correctly by the above for loop, so don't > - * try to re-initialize it. > - */ > + /* Delete unused buffers */ Period at the end of comment. > for (; i < NETDEV_MAX_BURST; i++) { > dp_packet_delete(buffers[i]); > } > > - /* netdev_linux_rxq_recv needs it to return 0 or positive errno */ > - if (retval < 0) { > - return -retval; > - }> - > return 0; > } > > @@ -1263,20 +1349,40 @@ free_buffers: > * packets are added into *batch. The return value is 0 or errno. > */ > static int > -netdev_linux_batch_rxq_recv_tap(int fd, int mtu, struct dp_packet_batch > *batch) > +netdev_linux_batch_rxq_recv_tap(struct netdev_rxq_linux *rx, int mtu, > + struct dp_packet_batch *batch) > { > struct dp_packet *buffer; > + int virtio_net_hdr_size; > ssize_t retval; > - size_t size; > + size_t std_len; > + int iovlen; > int i; > > + if (userspace_tso_enabled()) { > + /* Use the buffer from the allocated packet below to receive MTU > + * sized packets and an aux_buf for extra TSO data. */ > + iovlen = IOV_TSO_SIZE; > + virtio_net_hdr_size = sizeof(struct virtio_net_hdr); > + } else { > + /* Use only the buffer from the allocated packet. */ > + iovlen = IOV_STD_SIZE; > + virtio_net_hdr_size = 0; > + } > + > + std_len = VLAN_ETH_HEADER_LEN + mtu + virtio_net_hdr_size; > for (i = 0; i < NETDEV_MAX_BURST; i++) { > + struct iovec iov[IOV_TSO_SIZE]; > + > /* Assume Ethernet port. No need to set packet_type. */ > - buffer = dp_packet_new_with_headroom(VLAN_ETH_HEADER_LEN + mtu, > - DP_NETDEV_HEADROOM); > - size = dp_packet_tailroom(buffer); > + buffer = dp_packet_new_with_headroom(std_len, DP_NETDEV_HEADROOM); > + iov[IOV_PACKET].iov_base = dp_packet_data(buffer); > + iov[IOV_PACKET].iov_len = std_len; > + iov[IOV_AUXBUF].iov_base = rx->aux_bufs[i]; > + iov[IOV_AUXBUF].iov_len = LINUX_RXQ_TSO_MAX_LEN; > + > do { > - retval = read(fd, dp_packet_data(buffer), size); > + retval = readv(rx->fd, iov, iovlen); > } while (retval < 0 && errno == EINTR); > > if (retval < 0) { > @@ -1284,7 +1390,33 @@ netdev_linux_batch_rxq_recv_tap(int fd, int mtu, > struct dp_packet_batch *batch) > break; > } > > - dp_packet_set_size(buffer, dp_packet_size(buffer) + retval); > + if (retval > std_len) { > + /* Build a single linear TSO packet by expanding the current > packet > + * to append the data received in the aux_buf. */ > + size_t extra_len = retval - std_len; > + > + dp_packet_set_size(buffer, dp_packet_size(buffer) + std_len); > + dp_packet_prealloc_tailroom(buffer, extra_len); > + memcpy(dp_packet_tail(buffer), rx->aux_bufs[i], extra_len); > + dp_packet_set_size(buffer, dp_packet_size(buffer) + extra_len); > + } else { > + dp_packet_set_size(buffer, dp_packet_size(buffer) + retval); > + } > + > + if (virtio_net_hdr_size && netdev_linux_parse_vnet_hdr(buffer)) { > + struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up); > + struct netdev_linux *netdev = netdev_linux_cast(netdev_); > + > + /* Unexpected error situation: the virtio header is not present > + * or corrupted. Drop the packet but continue in case next ones > + * are correct. */ > + dp_packet_delete(buffer); > + netdev->rx_dropped += 1; > + VLOG_WARN_RL(&rl, "%s: Dropped packet: Invalid virtio net > header", > + netdev_get_name(netdev_)); > + continue; > + } > + > dp_packet_batch_add(batch, buffer); > } > > @@ -1310,8 +1442,8 @@ netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct > dp_packet_batch *batch, > > dp_packet_batch_init(batch); > retval = (rx->is_tap > - ? netdev_linux_batch_rxq_recv_tap(rx->fd, mtu, batch) > - : netdev_linux_batch_rxq_recv_sock(rx->fd, mtu, batch)); > + ? netdev_linux_batch_rxq_recv_tap(rx, mtu, batch) > + : netdev_linux_batch_rxq_recv_sock(rx, mtu, batch)); > > if (retval) { > if (retval != EAGAIN && retval != EMSGSIZE) { > @@ -1353,7 +1485,7 @@ netdev_linux_rxq_drain(struct netdev_rxq *rxq_) > } > > static int > -netdev_linux_sock_batch_send(int sock, int ifindex, > +netdev_linux_sock_batch_send(int sock, int ifindex, bool tso, int mtu, > struct dp_packet_batch *batch) > { > const size_t size = dp_packet_batch_size(batch); > @@ -1367,6 +1499,10 @@ netdev_linux_sock_batch_send(int sock, int ifindex, > > struct dp_packet *packet; > DP_PACKET_BATCH_FOR_EACH (i, packet, batch) { > + if (tso) { > + netdev_linux_prepend_vnet_hdr(packet, mtu); > + } > + > iov[i].iov_base = dp_packet_data(packet); > iov[i].iov_len = dp_packet_size(packet); > mmsg[i].msg_hdr = (struct msghdr) { .msg_name = &sll, > @@ -1399,7 +1535,7 @@ netdev_linux_sock_batch_send(int sock, int ifindex, > * on other interface types because we attach a socket filter to the rx > * socket. */ > static int > -netdev_linux_tap_batch_send(struct netdev *netdev_, > +netdev_linux_tap_batch_send(struct netdev *netdev_, bool tso, int mtu, > struct dp_packet_batch *batch) > { > struct netdev_linux *netdev = netdev_linux_cast(netdev_); > @@ -1416,10 +1552,15 @@ netdev_linux_tap_batch_send(struct netdev *netdev_, > } > > DP_PACKET_BATCH_FOR_EACH (i, packet, batch) { > - size_t size = dp_packet_size(packet); > + size_t size; > ssize_t retval; > int error; > > + if (tso) { > + netdev_linux_prepend_vnet_hdr(packet, mtu); > + } > + > + size = dp_packet_size(packet); > do { > retval = write(netdev->tap_fd, dp_packet_data(packet), size); > error = retval < 0 ? errno : 0; > @@ -1454,9 +1595,15 @@ netdev_linux_send(struct netdev *netdev_, int qid > OVS_UNUSED, > struct dp_packet_batch *batch, > bool concurrent_txq OVS_UNUSED) > { > + bool tso = userspace_tso_enabled(); > + int mtu = ETH_PAYLOAD_MAX; > int error = 0; > int sock = 0; > > + if (tso) { > + netdev_linux_get_mtu__(netdev_linux_cast(netdev_), &mtu); netdev_linux_get_mtu__() could fail. This needs to be handled. > + } > + > if (!is_tap_netdev(netdev_)) { > if (netdev_linux_netnsid_is_remote(netdev_linux_cast(netdev_))) { > error = EOPNOTSUPP; > @@ -1475,9 +1622,9 @@ netdev_linux_send(struct netdev *netdev_, int qid > OVS_UNUSED, > goto free_batch; > } > > - error = netdev_linux_sock_batch_send(sock, ifindex, batch); > + error = netdev_linux_sock_batch_send(sock, ifindex, tso, mtu, batch); > } else { > - error = netdev_linux_tap_batch_send(netdev_, batch); > + error = netdev_linux_tap_batch_send(netdev_, tso, mtu, batch); > } > if (error) { > if (error == ENOBUFS) { > @@ -2045,6 +2192,7 @@ netdev_tap_get_stats(const struct netdev *netdev_, > struct netdev_stats *stats) > stats->collisions += dev_stats.collisions; > } > stats->tx_dropped += netdev->tx_dropped; > + stats->rx_dropped += netdev->rx_dropped; > ovs_mutex_unlock(&netdev->mutex); > > return error; > @@ -6223,6 +6371,17 @@ af_packet_sock(void) > if (error) { > close(sock); > sock = -error; > + } else if (userspace_tso_enabled()) { > + int val = 1; > + error = setsockopt(sock, SOL_PACKET, PACKET_VNET_HDR, &val, > + sizeof val); > + if (error) { > + error = errno; > + VLOG_ERR("failed to enable vnet hdr in raw socket: %s", > + ovs_strerror(errno)); > + close(sock); > + sock = -error; > + } > } > } else { > sock = -errno; > @@ -6234,3 +6393,136 @@ af_packet_sock(void) > > return sock; > } > + > +static int > +netdev_linux_parse_l2(struct dp_packet *b, uint16_t *l4proto) I'm not generally a fan of parsing packets here on receive especially because we will parse it with miniflow_extract() later, but I'm not sure how to avoid that. > +{ > + struct eth_header *eth_hdr; > + ovs_be16 eth_type; > + int l2_len; > + > + eth_hdr = dp_packet_at(b, 0, ETH_HEADER_LEN); > + if (!eth_hdr) { > + return -EINVAL; > + } > + > + l2_len = ETH_HEADER_LEN; > + eth_type = eth_hdr->eth_type; > + if (eth_type_vlan(eth_type)) { > + struct vlan_header *vlan = dp_packet_at(b, l2_len, VLAN_HEADER_LEN); > + > + if (!vlan) { > + return -EINVAL; > + } > + > + eth_type = vlan->vlan_next_type; > + l2_len += VLAN_HEADER_LEN; > + } > + > + if (eth_type == htons(ETH_TYPE_IP)) { > + struct ip_header *ip_hdr = dp_packet_at(b, l2_len, IP_HEADER_LEN); > + > + if (!ip_hdr) { > + return -EINVAL; > + } > + > + *l4proto = ip_hdr->ip_proto; > + dp_packet_hwol_set_tx_ipv4(b); > + } else if (eth_type == htons(ETH_TYPE_IPV6)) { > + struct ovs_16aligned_ip6_hdr *nh6; > + > + nh6 = dp_packet_at(b, l2_len, IPV6_HEADER_LEN); > + if (!nh6) { > + return -EINVAL; > + } > + > + *l4proto = nh6->ip6_ctlun.ip6_un1.ip6_un1_nxt; > + dp_packet_hwol_set_tx_ipv6(b); > + } > + > + return 0; > +} > + > +static int > +netdev_linux_parse_vnet_hdr(struct dp_packet *b) > +{ > + struct virtio_net_hdr *vnet = dp_packet_pull(b, sizeof *vnet); > + uint16_t l4proto = 0; > + > + if (OVS_UNLIKELY(!vnet)) { > + return -EINVAL; > + } > + > + if (vnet->flags == 0 && vnet->gso_type == VIRTIO_NET_HDR_GSO_NONE) { > + return 0; > + } > + > + if (netdev_linux_parse_l2(b, &l4proto)) { > + return -EINVAL; > + } > + > + if (vnet->flags == VIRTIO_NET_HDR_F_NEEDS_CSUM) { > + if (l4proto == IPPROTO_TCP) { > + dp_packet_hwol_set_csum_tcp(b); > + } else if (l4proto == IPPROTO_UDP) { > + dp_packet_hwol_set_csum_udp(b); > + } else if (l4proto == IPPROTO_SCTP) { > + dp_packet_hwol_set_csum_sctp(b); > + } > + } > + > + if (l4proto && vnet->gso_type != VIRTIO_NET_HDR_GSO_NONE) { > + uint8_t allowed_mask = VIRTIO_NET_HDR_GSO_TCPV4 > + | VIRTIO_NET_HDR_GSO_TCPV6 > + | VIRTIO_NET_HDR_GSO_UDP; > + uint8_t type = vnet->gso_type & allowed_mask; > + > + if (type == VIRTIO_NET_HDR_GSO_TCPV4 > + || type == VIRTIO_NET_HDR_GSO_TCPV6) { > + dp_packet_hwol_set_tcp_seg(b); > + } > + } > + > + return 0; > +} > + > +static void > +netdev_linux_prepend_vnet_hdr(struct dp_packet *b, int mtu) > +{ > + struct virtio_net_hdr *vnet = dp_packet_push_zeros(b, sizeof *vnet); > + > + if (dp_packet_hwol_is_tso(b)) { > + uint16_t hdr_len = ((char *)dp_packet_l4(b) - (char > *)dp_packet_eth(b)) > + + TCP_HEADER_LEN; > + > + vnet->hdr_len = (OVS_FORCE __virtio16)hdr_len; > + vnet->gso_size = (OVS_FORCE __virtio16)(mtu - hdr_len); > + if (dp_packet_hwol_is_ipv4(b)) { > + vnet->gso_type = VIRTIO_NET_HDR_GSO_TCPV4; > + } else { > + vnet->gso_type = VIRTIO_NET_HDR_GSO_TCPV6; > + } > + > + } else { > + vnet->flags = VIRTIO_NET_HDR_GSO_NONE; > + } > + > + if (dp_packet_hwol_l4_mask(b)) { > + vnet->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; > + vnet->csum_start = (OVS_FORCE __virtio16)((char *)dp_packet_l4(b) > + - (char > *)dp_packet_eth(b)); > + > + if (dp_packet_hwol_l4_is_tcp(b)) { > + vnet->csum_offset = (OVS_FORCE __virtio16) __builtin_offsetof( > + struct tcp_header, tcp_csum); > + } else if (dp_packet_hwol_l4_is_udp(b)) { > + vnet->csum_offset = (OVS_FORCE __virtio16) __builtin_offsetof( > + struct udp_header, udp_csum); > + } else if (dp_packet_hwol_l4_is_sctp(b)) { > + vnet->csum_offset = (OVS_FORCE __virtio16) __builtin_offsetof( > + struct sctp_header, sctp_csum); > + } else { > + VLOG_WARN_RL(&rl, "Unsupported L4 protocol"); > + } > + } > +} > diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h > index f109c4e66..22f4cde33 100644 > --- a/lib/netdev-provider.h > +++ b/lib/netdev-provider.h > @@ -37,6 +37,12 @@ extern "C" { > struct netdev_tnl_build_header_params; > #define NETDEV_NUMA_UNSPEC OVS_NUMA_UNSPEC > > +enum netdev_ol_flags { > + NETDEV_TX_OFFLOAD_IPV4_CKSUM = 1 << 0, > + NETDEV_TX_OFFLOAD_TCP_CKSUM = 1 << 1, > + NETDEV_TX_OFFLOAD_TCP_TSO = 1 << 2, > +}; > + > /* A network device (e.g. an Ethernet device). > * > * Network device implementations may read these members but should not > modify > @@ -51,6 +57,9 @@ struct netdev { > * opening this device, and therefore got assigned to the "system" class > */ > bool auto_classified; > > + /* This bitmask of the offloading features enabled by the netdev. */ > + uint64_t ol_flags; > + > /* If this is 'true', the user explicitly specified an MTU for this > * netdev. Otherwise, Open vSwitch is allowed to override it. */ > bool mtu_user_config; > diff --git a/lib/netdev.c b/lib/netdev.c > index 405c98c68..f95b19af4 100644 > --- a/lib/netdev.c > +++ b/lib/netdev.c > @@ -66,6 +66,8 @@ COVERAGE_DEFINE(netdev_received); > COVERAGE_DEFINE(netdev_sent); > COVERAGE_DEFINE(netdev_add_router); > COVERAGE_DEFINE(netdev_get_stats); > +COVERAGE_DEFINE(netdev_send_prepare_drops); > +COVERAGE_DEFINE(netdev_push_header_drops); > > struct netdev_saved_flags { > struct netdev *netdev; > @@ -782,6 +784,54 @@ netdev_get_pt_mode(const struct netdev *netdev) > : NETDEV_PT_LEGACY_L2); > } > > +/* Check if a 'packet' is compatible with 'netdev_flags'. > + * If a packet is incompatible, return 'false' with the 'errormsg' > + * pointing to a reason. */ > +static bool > +netdev_send_prepare_packet(const uint64_t netdev_flags, > + struct dp_packet *packet, char **errormsg) > +{ > + if (dp_packet_hwol_is_tso(packet) > + && !(netdev_flags & NETDEV_TX_OFFLOAD_TCP_TSO)) { > + /* Fall back to GSO in software. */ > + VLOG_ERR_BUF(errormsg, "No TSO support"); > + return false; > + } > + > + if (dp_packet_hwol_l4_mask(packet) > + && !(netdev_flags & NETDEV_TX_OFFLOAD_TCP_CKSUM)) { > + /* Fall back to L4 csum in software. */ > + VLOG_ERR_BUF(errormsg, "No L4 checksum support"); > + return false; > + } > + > + return true; > +} > + > +/* Check if each packet in 'batch' is compatible with 'netdev' features, > + * otherwise either fall back to software implementation or drop it. */ > +static void > +netdev_send_prepare_batch(const struct netdev *netdev, > + struct dp_packet_batch *batch) > +{ > + struct dp_packet *packet; > + size_t i, size = dp_packet_batch_size(batch); > + > + DP_PACKET_BATCH_REFILL_FOR_EACH (i, size, packet, batch) { > + char *errormsg = NULL; > + > + if (netdev_send_prepare_packet(netdev->ol_flags, packet, &errormsg)) > { > + dp_packet_batch_refill(batch, packet, i); > + } else { > + dp_packet_delete(packet); > + COVERAGE_INC(netdev_send_prepare_drops); > + VLOG_WARN_RL(&rl, "%s: Packet dropped: %s", > + netdev_get_name(netdev), errormsg); > + free(errormsg); > + } > + } > +} > + > /* Sends 'batch' on 'netdev'. Returns 0 if successful (for every packet), > * otherwise a positive errno value. Returns EAGAIN without blocking if > * at least one the packets cannot be queued immediately. Returns EMSGSIZE > @@ -811,8 +861,14 @@ int > netdev_send(struct netdev *netdev, int qid, struct dp_packet_batch *batch, > bool concurrent_txq) > { > - int error = netdev->netdev_class->send(netdev, qid, batch, > - concurrent_txq); > + int error; > + > + netdev_send_prepare_batch(netdev, batch); > + if (OVS_UNLIKELY(dp_packet_batch_is_empty(batch))) { > + return 0; > + } > + > + error = netdev->netdev_class->send(netdev, qid, batch, concurrent_txq); > if (!error) { > COVERAGE_INC(netdev_sent); > } > @@ -878,9 +934,21 @@ netdev_push_header(const struct netdev *netdev, > const struct ovs_action_push_tnl *data) > { > struct dp_packet *packet; > - DP_PACKET_BATCH_FOR_EACH (i, packet, batch) { > - netdev->netdev_class->push_header(netdev, packet, data); > - pkt_metadata_init(&packet->md, data->out_port); > + size_t i, size = dp_packet_batch_size(batch); > + > + DP_PACKET_BATCH_REFILL_FOR_EACH (i, size, packet, batch) { > + if (OVS_UNLIKELY(dp_packet_hwol_is_tso(packet) > + || dp_packet_hwol_l4_mask(packet))) { > + COVERAGE_INC(netdev_push_header_drops); > + dp_packet_delete(packet); > + VLOG_WARN_RL(&rl, "%s: Tunneling packets with HW offload flags > is " > + "not supported: packet dropped", > + netdev_get_name(netdev)); > + } else { > + netdev->netdev_class->push_header(netdev, packet, data); > + pkt_metadata_init(&packet->md, data->out_port); > + dp_packet_batch_refill(batch, packet, i); > + } > } > > return 0; > diff --git a/lib/userspace-tso.c b/lib/userspace-tso.c > new file mode 100644 > index 000000000..f843c2a76 > --- /dev/null > +++ b/lib/userspace-tso.c > @@ -0,0 +1,48 @@ > +/* > + * Copyright (c) 2020 Red Hat, Inc. > + * > + * Licensed under the Apache License, Version 2.0 (the "License"); > + * you may not use this file except in compliance with the License. > + * You may obtain a copy of the License at: > + * > + * http://www.apache.org/licenses/LICENSE-2.0 > + * > + * Unless required by applicable law or agreed to in writing, software > + * distributed under the License is distributed on an "AS IS" BASIS, > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. > + * See the License for the specific language governing permissions and > + * limitations under the License. > + */ > + > +#include <config.h> > + > +#include "smap.h" > +#include "ovs-thread.h" > +#include "openvswitch/vlog.h" > +#include "dpdk.h" > +#include "userspace-tso.h" > +#include "vswitch-idl.h" > + > +VLOG_DEFINE_THIS_MODULE(userspace_tso); > + > +static bool userspace_tso = false; > + > +void > +userspace_tso_init(const struct smap *ovs_other_config) > +{ > + if (smap_get_bool(ovs_other_config, "userspace-tso-enable", false)) { > + static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER; > + > + if (ovsthread_once_start(&once)) { > + VLOG_INFO("Userspace TCP Segmentation Offloading support > enabled"); > + userspace_tso = true; Since dp_packet functions has no implementation if OVS built without DPDK support, I think, we need to restrict enabling the functionality. #ifdef DPDK_NETDEV VLOG_INFO("Userspace TCP Segmentation Offloading support enabled"); userspace_tso = true; #else VLOG_WARN("Userspace TCP Segmentation Offloading can not be enabled" "since OVS built without DPDK support."); #endif > + ovsthread_once_done(&once); > + } > + } > +} > + > +bool > +userspace_tso_enabled(void) > +{ > + return userspace_tso; > +} > diff --git a/lib/userspace-tso.h b/lib/userspace-tso.h > new file mode 100644 > index 000000000..0758274c0 > --- /dev/null > +++ b/lib/userspace-tso.h > @@ -0,0 +1,23 @@ > +/* > + * Copyright (c) 2020 Red Hat Inc. > + * > + * Licensed under the Apache License, Version 2.0 (the "License"); > + * you may not use this file except in compliance with the License. > + * You may obtain a copy of the License at: > + * > + * http://www.apache.org/licenses/LICENSE-2.0 > + * > + * Unless required by applicable law or agreed to in writing, software > + * distributed under the License is distributed on an "AS IS" BASIS, > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. > + * See the License for the specific language governing permissions and > + * limitations under the License. > + */ > + > +#ifndef USERSPACE_TSO_H > +#define USERSPACE_TSO_H 1 > + > +void userspace_tso_init(const struct smap *ovs_other_config); > +bool userspace_tso_enabled(void); > + > +#endif /* userspace-tso.h */ > diff --git a/vswitchd/bridge.c b/vswitchd/bridge.c > index 86c7b10a9..e591c26a6 100644 > --- a/vswitchd/bridge.c > +++ b/vswitchd/bridge.c > @@ -65,6 +65,7 @@ > #include "system-stats.h" > #include "timeval.h" > #include "tnl-ports.h" > +#include "userspace-tso.h" > #include "util.h" > #include "unixctl.h" > #include "lib/vswitch-idl.h" > @@ -3285,6 +3286,7 @@ bridge_run(void) > if (cfg) { > netdev_set_flow_api_enabled(&cfg->other_config); > dpdk_init(&cfg->other_config); > + userspace_tso_init(&cfg->other_config); > } > > /* Initialize the ofproto library. This only needs to run once, but > diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml > index c43cb1aa4..a9efe71a5 100644 > --- a/vswitchd/vswitch.xml > +++ b/vswitchd/vswitch.xml > @@ -690,6 +690,23 @@ > once in few hours or a day or a week. > </p> > </column> > + <column name="other_config" key="userspace-tso-enable" > + type='{"type": "boolean"}'> > + <p> > + Set this value to <code>true</code> to enable userspace support for > + TCP Segmentation Offloading (TSO). When it is enabled, the > interfaces > + can provide an oversized TCP segment to the datapath and the > datapath > + will offload the TCP segmentation and checksum calculation to the > + interfaces when necessary. > + </p> > + <p> > + The default value is <code>false</code>. Changing this value > requires > + restarting the daemon. Works only if OVS is built with DPDK support. > + </p> > + <p> > + The feature is considered experimental. > + </p> > + </column> > </group> > <group title="Status"> > <column name="next_cfg"> > _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
