Thanks all for review/testing, pushed to master. Regards Ian
-----Original Message----- From: dev <[email protected]> On Behalf Of Stokes, Ian Sent: Friday, January 17, 2020 10:56 PM To: Flavio Leitner <[email protected]>; [email protected] Cc: Ilya Maximets <[email protected]>; txfh2007 <[email protected]> Subject: Re: [ovs-dev] [PATCH v5] userspace: Add TCP Segmentation Offload support On 1/17/2020 9:54 PM, Stokes, Ian wrote: > > > On 1/17/2020 9:47 PM, Flavio Leitner wrote: >> Abbreviated as TSO, TCP Segmentation Offload is a feature which enables >> the network stack to delegate the TCP segmentation to the NIC reducing >> the per packet CPU overhead. >> >> A guest using vhostuser interface with TSO enabled can send TCP packets >> much bigger than the MTU, which saves CPU cycles normally used to break >> the packets down to MTU size and to calculate checksums. >> >> It also saves CPU cycles used to parse multiple packets/headers during >> the packet processing inside virtual switch. >> >> If the destination of the packet is another guest in the same host, then >> the same big packet can be sent through a vhostuser interface skipping >> the segmentation completely. However, if the destination is not local, >> the NIC hardware is instructed to do the TCP segmentation and checksum >> calculation. >> >> It is recommended to check if NIC hardware supports TSO before enabling >> the feature, which is off by default. For additional information please >> check the tso.rst document. >> >> Signed-off-by: Flavio Leitner <[email protected]> > > Fantastic work here Flavio, quick turn arouround when needed. > > Acked Are the any objectionions to merging this? Theres been nothhing so far. If no further objections I will merge this at the end of the hour? BR Ian > > BR > Ian >> --- >> Documentation/automake.mk | 1 + >> Documentation/topics/index.rst | 1 + >> Documentation/topics/userspace-tso.rst | 98 +++++++ >> NEWS | 1 + >> lib/automake.mk | 2 + >> lib/conntrack.c | 29 +- >> lib/dp-packet.h | 176 ++++++++++- >> lib/ipf.c | 32 +- >> lib/netdev-dpdk.c | 348 +++++++++++++++++++--- >> lib/netdev-linux-private.h | 5 + >> lib/netdev-linux.c | 386 ++++++++++++++++++++++--- >> lib/netdev-provider.h | 9 + >> lib/netdev.c | 78 ++++- >> lib/userspace-tso.c | 53 ++++ >> lib/userspace-tso.h | 23 ++ >> vswitchd/bridge.c | 2 + >> vswitchd/vswitch.xml | 20 ++ >> 17 files changed, 1140 insertions(+), 124 deletions(-) >> create mode 100644 Documentation/topics/userspace-tso.rst >> create mode 100644 lib/userspace-tso.c >> create mode 100644 lib/userspace-tso.h >> >> Testing: >> - Travis, Cirrus, AppVeyor, testsuite passed OK. >> - notice no changes since v4 with regards to performance. >> >> Changelog: >> - v5 >> * rebased on top of master (NEWS conflict) >> * added missing periods at the end of comments >> * mention DPDK requirement at vswitch.xml >> * restricted tso feature to OvS built with dpdk >> * headers in alphabetical order >> * removed unneeded call to initialize pkt >> * used OVS_UNLIKELY instead of unlikely >> * removed parenthesis from sizeof() >> * removed blank line at dp_packet_hwol_tx_l4_checksum() >> * removed redundant dp_packet_hwol_tx_ipv4_checksum() >> * updated function comments as suggested >> >> - v4 >> * rebased on top of master (recvmmsg) >> * fixed URL in doc to point to 19.11 >> * renamed tso to userspace-tso >> * renamed the option to userspace-tso-enable >> * removed prototype that left over from v2 >> * fixed function style declaration >> * renamed dp_packet_hwol_tx_ip_checksum to >> dp_packet_hwol_tx_ipv4_checksum >> * dp_packet_hwol_tx_ipv4_checksum now checks for PKT_TX_IPV4. >> * account for drops while preping the batch for TX. >> * don't prep the batch for TX if TSO is disabled. >> * simplified setsockopt error checking >> * fixed af_packet_sock error checking to not call setsockopt on >> closed sockets. >> * fixed ol_flags comment. >> * used VLOG_ERR_BUF() to pass error messages. >> * fixed packet leak at netdev_send_prepare_batch() >> * added a coverage counter to account drops while preparing a batch >> at netdev.c >> * fixed netdev_send() to not call ->send() if the batch is empty. >> * fixed packet leak at netdev_push_header and account for the drops. >> * removed DPDK requirement to enable userspace TSO support. >> * fixed parameter documentation in vswitch.xml. >> * renamed tso.rst to userspace-tso.rst and moved to topics/ >> * added comments documeting the functions in dp-packet.h >> * fixed dp_packet_hwol_is_tso to check only PKT_TX_TCP_SEG >> >> - v3 >> * Improved the documentation. >> * Updated copyright year to 2020. >> * TSO offloaded msg now includes the netdev's name. >> * Added period at the end of all code comments. >> * Warn and drop encapsulation of TSO packets. >> * Fixed travis issue with restricted virtio types. >> * Fixed double headroom allocation in dpdk_copy_dp_packet_to_mbuf() >> which caused packet corruption. >> * Fixed netdev_dpdk_prep_hwol_packet() to unconditionally set >> PKT_TX_IP_CKSUM only for IPv4 packets. >> >> diff --git a/Documentation/automake.mk b/Documentation/automake.mk >> index f2ca17bad..22976a3cd 100644 >> --- a/Documentation/automake.mk >> +++ b/Documentation/automake.mk >> @@ -57,6 +57,7 @@ DOC_SOURCE = \ >> Documentation/topics/ovsdb-replication.rst \ >> Documentation/topics/porting.rst \ >> Documentation/topics/tracing.rst \ >> + Documentation/topics/userspace-tso.rst \ >> Documentation/topics/windows.rst \ >> Documentation/howto/index.rst \ >> Documentation/howto/dpdk.rst \ >> diff --git a/Documentation/topics/index.rst >> b/Documentation/topics/index.rst >> index 34c4b10e0..08af3a24d 100644 >> --- a/Documentation/topics/index.rst >> +++ b/Documentation/topics/index.rst >> @@ -50,5 +50,6 @@ OVS >> language-bindings >> testing >> tracing >> + userspace-tso >> idl-compound-indexes >> ovs-extensions >> diff --git a/Documentation/topics/userspace-tso.rst >> b/Documentation/topics/userspace-tso.rst >> new file mode 100644 >> index 000000000..893c64839 >> --- /dev/null >> +++ b/Documentation/topics/userspace-tso.rst >> @@ -0,0 +1,98 @@ >> +.. >> + Copyright 2020, Red Hat, Inc. >> + >> + Licensed under the Apache License, Version 2.0 (the "License"); >> you may >> + not use this file except in compliance with the License. You >> may obtain >> + a copy of the License at >> + >> + http://www.apache.org/licenses/LICENSE-2.0 >> + >> + Unless required by applicable law or agreed to in writing, >> software >> + distributed under the License is distributed on an "AS IS" >> BASIS, WITHOUT >> + WARRANTIES OR CONDITIONS OF ANY KIND, either express or >> implied. See the >> + License for the specific language governing permissions and >> limitations >> + under the License. >> + >> + Convention for heading levels in Open vSwitch documentation: >> + >> + ======= Heading 0 (reserved for the title in a document) >> + ------- Heading 1 >> + ~~~~~~~ Heading 2 >> + +++++++ Heading 3 >> + ''''''' Heading 4 >> + >> + Avoid deeper levels because they do not render well. >> + >> +======================== >> +Userspace Datapath - TSO >> +======================== >> + >> +**Note:** This feature is considered experimental. >> + >> +TCP Segmentation Offload (TSO) enables a network stack to delegate >> segmentation >> +of an oversized TCP segment to the underlying physical NIC. Offload >> of frame >> +segmentation achieves computational savings in the core, freeing up >> CPU cycles >> +for more useful work. >> + >> +A common use case for TSO is when using virtualization, where traffic >> that's >> +coming in from a VM can offload the TCP segmentation, thus avoiding the >> +fragmentation in software. Additionally, if the traffic is headed to >> a VM >> +within the same host further optimization can be expected. As the >> traffic never >> +leaves the machine, no MTU needs to be accounted for, and thus no >> segmentation >> +and checksum calculations are required, which saves yet more cycles. >> Only when >> +the traffic actually leaves the host the segmentation needs to >> happen, in which >> +case it will be performed by the egress NIC. Consult your controller's >> +datasheet for compatibility. Secondly, the NIC must have an >> associated DPDK >> +Poll Mode Driver (PMD) which supports `TSO`. For a list of features >> per PMD, >> +refer to the `DPDK documentation`__. >> + >> +__ https://doc.dpdk.org/guides-19.11/nics/overview.html >> + >> +Enabling TSO >> +~~~~~~~~~~~~ >> + >> +The TSO support may be enabled via a global config value >> +``userspace-tso-enable``. Setting this to ``true`` enables TSO >> support for >> +all ports. >> + >> + $ ovs-vsctl set Open_vSwitch . >> other_config:userspace-tso-enable=true >> + >> +The default value is ``false``. >> + >> +Changing ``userspace-tso-enable`` requires restarting the daemon. >> + >> +When using :doc:`vHost User ports <dpdk/vhost-user>`, TSO may be enabled >> +as follows. >> + >> +`TSO` is enabled in OvS by the DPDK vHost User backend; when a new guest >> +connection is established, `TSO` is thus advertised to the guest as an >> +available feature: >> + >> +QEMU Command Line Parameter:: >> + >> + $ sudo $QEMU_DIR/x86_64-softmmu/qemu-system-x86_64 \ >> + ... >> + -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,\ >> + csum=on,guest_csum=on,guest_tso4=on,guest_tso6=on\ >> + ... >> + >> +2. Ethtool. Assuming that the guest's OS also supports `TSO`, ethtool >> can be >> +used to enable same:: >> + >> + $ ethtool -K eth0 sg on # scatter-gather is a prerequisite >> for TSO >> + $ ethtool -K eth0 tso on >> + $ ethtool -k eth0 >> + >> +~~~~~~~~~~~ >> +Limitations >> +~~~~~~~~~~~ >> + >> +The current OvS userspace `TSO` implementation supports flat and VLAN >> networks >> +only (i.e. no support for `TSO` over tunneled connection [VxLAN, GRE, >> IPinIP, >> +etc.]). >> + >> +There is no software implementation of TSO, so all ports attached to the >> +datapath must support TSO or packets using that feature will be dropped >> +on ports without TSO support. That also means guests using vhost-user >> +in client mode will receive TSO packet regardless of TSO being enabled >> +or disabled within the guest. >> diff --git a/NEWS b/NEWS >> index 579e91c89..c6d3b6053 100644 >> --- a/NEWS >> +++ b/NEWS >> @@ -30,6 +30,7 @@ Post-v2.12.0 >> * Add support for DPDK 19.11. >> * Add hardware offload support for output, drop, set of MAC, >> IPv4 and >> TCP/UDP ports actions (experimental). >> + * Add experimental support for TSO. >> - RSTP: >> * The rstp_statistics column in Port table will only be updated >> every >> stats-update-interval configured in Open_vSwitch table. >> diff --git a/lib/automake.mk b/lib/automake.mk >> index ebf714501..95925b57c 100644 >> --- a/lib/automake.mk >> +++ b/lib/automake.mk >> @@ -314,6 +314,8 @@ lib_libopenvswitch_la_SOURCES = \ >> lib/unicode.h \ >> lib/unixctl.c \ >> lib/unixctl.h \ >> + lib/userspace-tso.c \ >> + lib/userspace-tso.h \ >> lib/util.c \ >> lib/util.h \ >> lib/uuid.c \ >> diff --git a/lib/conntrack.c b/lib/conntrack.c >> index b80080e72..60222ca53 100644 >> --- a/lib/conntrack.c >> +++ b/lib/conntrack.c >> @@ -2022,7 +2022,8 @@ conn_key_extract(struct conntrack *ct, struct >> dp_packet *pkt, ovs_be16 dl_type, >> if (hwol_bad_l3_csum) { >> ok = false; >> } else { >> - bool hwol_good_l3_csum = dp_packet_ip_checksum_valid(pkt); >> + bool hwol_good_l3_csum = dp_packet_ip_checksum_valid(pkt) >> + || dp_packet_hwol_is_ipv4(pkt); >> /* Validate the checksum only when hwol is not >> supported. */ >> ok = extract_l3_ipv4(&ctx->key, l3, >> dp_packet_l3_size(pkt), NULL, >> !hwol_good_l3_csum); >> @@ -2036,7 +2037,8 @@ conn_key_extract(struct conntrack *ct, struct >> dp_packet *pkt, ovs_be16 dl_type, >> if (ok) { >> bool hwol_bad_l4_csum = dp_packet_l4_checksum_bad(pkt); >> if (!hwol_bad_l4_csum) { >> - bool hwol_good_l4_csum = dp_packet_l4_checksum_valid(pkt); >> + bool hwol_good_l4_csum = dp_packet_l4_checksum_valid(pkt) >> + || >> dp_packet_hwol_tx_l4_checksum(pkt); >> /* Validate the checksum only when hwol is not >> supported. */ >> if (extract_l4(&ctx->key, l4, dp_packet_l4_size(pkt), >> &ctx->icmp_related, l3, !hwol_good_l4_csum, >> @@ -3237,8 +3239,11 @@ handle_ftp_ctl(struct conntrack *ct, const >> struct conn_lookup_ctx *ctx, >> } >> if (seq_skew) { >> ip_len = ntohs(l3_hdr->ip_tot_len) + seq_skew; >> - l3_hdr->ip_csum = recalc_csum16(l3_hdr->ip_csum, >> - l3_hdr->ip_tot_len, >> htons(ip_len)); >> + if (!dp_packet_hwol_is_ipv4(pkt)) { >> + l3_hdr->ip_csum = recalc_csum16(l3_hdr->ip_csum, >> + >> l3_hdr->ip_tot_len, >> + htons(ip_len)); >> + } >> l3_hdr->ip_tot_len = htons(ip_len); >> } >> } >> @@ -3256,13 +3261,15 @@ handle_ftp_ctl(struct conntrack *ct, const >> struct conn_lookup_ctx *ctx, >> } >> th->tcp_csum = 0; >> - if (ctx->key.dl_type == htons(ETH_TYPE_IPV6)) { >> - th->tcp_csum = packet_csum_upperlayer6(nh6, th, >> ctx->key.nw_proto, >> - dp_packet_l4_size(pkt)); >> - } else { >> - uint32_t tcp_csum = packet_csum_pseudoheader(l3_hdr); >> - th->tcp_csum = csum_finish( >> - csum_continue(tcp_csum, th, dp_packet_l4_size(pkt))); >> + if (!dp_packet_hwol_tx_l4_checksum(pkt)) { >> + if (ctx->key.dl_type == htons(ETH_TYPE_IPV6)) { >> + th->tcp_csum = packet_csum_upperlayer6(nh6, th, >> ctx->key.nw_proto, >> + dp_packet_l4_size(pkt)); >> + } else { >> + uint32_t tcp_csum = packet_csum_pseudoheader(l3_hdr); >> + th->tcp_csum = csum_finish( >> + csum_continue(tcp_csum, th, dp_packet_l4_size(pkt))); >> + } >> } >> if (seq_skew) { >> diff --git a/lib/dp-packet.h b/lib/dp-packet.h >> index 133942155..69ae5dfac 100644 >> --- a/lib/dp-packet.h >> +++ b/lib/dp-packet.h >> @@ -456,7 +456,7 @@ dp_packet_init_specific(struct dp_packet *p) >> { >> /* This initialization is needed for packets that do not come >> from DPDK >> * interfaces, when vswitchd is built with --with-dpdk. */ >> - p->mbuf.tx_offload = p->mbuf.packet_type = 0; >> + p->mbuf.ol_flags = p->mbuf.tx_offload = p->mbuf.packet_type = 0; >> p->mbuf.nb_segs = 1; >> p->mbuf.next = NULL; >> } >> @@ -519,6 +519,95 @@ dp_packet_set_allocated(struct dp_packet *b, >> uint16_t s) >> b->mbuf.buf_len = s; >> } >> +/* Returns 'true' if packet 'b' is marked for TCP segmentation >> offloading. */ >> +static inline bool >> +dp_packet_hwol_is_tso(const struct dp_packet *b) >> +{ >> + return !!(b->mbuf.ol_flags & PKT_TX_TCP_SEG); >> +} >> + >> +/* Returns 'true' if packet 'b' is marked for IPv4 checksum >> offloading. */ >> +static inline bool >> +dp_packet_hwol_is_ipv4(const struct dp_packet *b) >> +{ >> + return !!(b->mbuf.ol_flags & PKT_TX_IPV4); >> +} >> + >> +/* Returns the L4 cksum offload bitmask. */ >> +static inline uint64_t >> +dp_packet_hwol_l4_mask(const struct dp_packet *b) >> +{ >> + return b->mbuf.ol_flags & PKT_TX_L4_MASK; >> +} >> + >> +/* Returns 'true' if packet 'b' is marked for TCP checksum >> offloading. */ >> +static inline bool >> +dp_packet_hwol_l4_is_tcp(const struct dp_packet *b) >> +{ >> + return (b->mbuf.ol_flags & PKT_TX_L4_MASK) == PKT_TX_TCP_CKSUM; >> +} >> + >> +/* Returns 'true' if packet 'b' is marked for UDP checksum >> offloading. */ >> +static inline bool >> +dp_packet_hwol_l4_is_udp(struct dp_packet *b) >> +{ >> + return (b->mbuf.ol_flags & PKT_TX_L4_MASK) == PKT_TX_UDP_CKSUM; >> +} >> + >> +/* Returns 'true' if packet 'b' is marked for SCTP checksum >> offloading. */ >> +static inline bool >> +dp_packet_hwol_l4_is_sctp(struct dp_packet *b) >> +{ >> + return (b->mbuf.ol_flags & PKT_TX_L4_MASK) == PKT_TX_SCTP_CKSUM; >> +} >> + >> +/* Mark packet 'b' for IPv4 checksum offloading. */ >> +static inline void >> +dp_packet_hwol_set_tx_ipv4(struct dp_packet *b) >> +{ >> + b->mbuf.ol_flags |= PKT_TX_IPV4; >> +} >> + >> +/* Mark packet 'b' for IPv6 checksum offloading. */ >> +static inline void >> +dp_packet_hwol_set_tx_ipv6(struct dp_packet *b) >> +{ >> + b->mbuf.ol_flags |= PKT_TX_IPV6; >> +} >> + >> +/* Mark packet 'b' for TCP checksum offloading. It implies that either >> + * the packet 'b' is marked for IPv4 or IPv6 checksum offloading. */ >> +static inline void >> +dp_packet_hwol_set_csum_tcp(struct dp_packet *b) >> +{ >> + b->mbuf.ol_flags |= PKT_TX_TCP_CKSUM; >> +} >> + >> +/* Mark packet 'b' for UDP checksum offloading. It implies that either >> + * the packet 'b' is marked for IPv4 or IPv6 checksum offloading. */ >> +static inline void >> +dp_packet_hwol_set_csum_udp(struct dp_packet *b) >> +{ >> + b->mbuf.ol_flags |= PKT_TX_UDP_CKSUM; >> +} >> + >> +/* Mark packet 'b' for SCTP checksum offloading. It implies that either >> + * the packet 'b' is marked for IPv4 or IPv6 checksum offloading. */ >> +static inline void >> +dp_packet_hwol_set_csum_sctp(struct dp_packet *b) >> +{ >> + b->mbuf.ol_flags |= PKT_TX_SCTP_CKSUM; >> +} >> + >> +/* Mark packet 'b' for TCP segmentation offloading. It implies that >> + * either the packet 'b' is marked for IPv4 or IPv6 checksum offloading >> + * and also for TCP checksum offloading. */ >> +static inline void >> +dp_packet_hwol_set_tcp_seg(struct dp_packet *b) >> +{ >> + b->mbuf.ol_flags |= PKT_TX_TCP_SEG; >> +} >> + >> /* Returns the RSS hash of the packet 'p'. Note that the returned >> value is >> * correct only if 'dp_packet_rss_valid(p)' returns true */ >> static inline uint32_t >> @@ -648,6 +737,84 @@ dp_packet_set_allocated(struct dp_packet *b, >> uint16_t s) >> b->allocated_ = s; >> } >> +/* There are no implementation when not DPDK enabled datapath. */ >> +static inline bool >> +dp_packet_hwol_is_tso(const struct dp_packet *b OVS_UNUSED) >> +{ >> + return false; >> +} >> + >> +/* There are no implementation when not DPDK enabled datapath. */ >> +static inline bool >> +dp_packet_hwol_is_ipv4(const struct dp_packet *b OVS_UNUSED) >> +{ >> + return false; >> +} >> + >> +/* There are no implementation when not DPDK enabled datapath. */ >> +static inline uint64_t >> +dp_packet_hwol_l4_mask(const struct dp_packet *b OVS_UNUSED) >> +{ >> + return 0; >> +} >> + >> +/* There are no implementation when not DPDK enabled datapath. */ >> +static inline bool >> +dp_packet_hwol_l4_is_tcp(const struct dp_packet *b OVS_UNUSED) >> +{ >> + return false; >> +} >> + >> +/* There are no implementation when not DPDK enabled datapath. */ >> +static inline bool >> +dp_packet_hwol_l4_is_udp(const struct dp_packet *b OVS_UNUSED) >> +{ >> + return false; >> +} >> + >> +/* There are no implementation when not DPDK enabled datapath. */ >> +static inline bool >> +dp_packet_hwol_l4_is_sctp(const struct dp_packet *b OVS_UNUSED) >> +{ >> + return false; >> +} >> + >> +/* There are no implementation when not DPDK enabled datapath. */ >> +static inline void >> +dp_packet_hwol_set_tx_ipv4(struct dp_packet *b OVS_UNUSED) >> +{ >> +} >> + >> +/* There are no implementation when not DPDK enabled datapath. */ >> +static inline void >> +dp_packet_hwol_set_tx_ipv6(struct dp_packet *b OVS_UNUSED) >> +{ >> +} >> + >> +/* There are no implementation when not DPDK enabled datapath. */ >> +static inline void >> +dp_packet_hwol_set_csum_tcp(struct dp_packet *b OVS_UNUSED) >> +{ >> +} >> + >> +/* There are no implementation when not DPDK enabled datapath. */ >> +static inline void >> +dp_packet_hwol_set_csum_udp(struct dp_packet *b OVS_UNUSED) >> +{ >> +} >> + >> +/* There are no implementation when not DPDK enabled datapath. */ >> +static inline void >> +dp_packet_hwol_set_csum_sctp(struct dp_packet *b OVS_UNUSED) >> +{ >> +} >> + >> +/* There are no implementation when not DPDK enabled datapath. */ >> +static inline void >> +dp_packet_hwol_set_tcp_seg(struct dp_packet *b OVS_UNUSED) >> +{ >> +} >> + >> /* Returns the RSS hash of the packet 'p'. Note that the returned >> value is >> * correct only if 'dp_packet_rss_valid(p)' returns true */ >> static inline uint32_t >> @@ -939,6 +1106,13 @@ dp_packet_batch_reset_cutlen(struct >> dp_packet_batch *batch) >> } >> } >> +/* Return true if the packet 'b' requested L4 checksum offload. */ >> +static inline bool >> +dp_packet_hwol_tx_l4_checksum(const struct dp_packet *b) >> +{ >> + return !!dp_packet_hwol_l4_mask(b); >> +} >> + >> #ifdef __cplusplus >> } >> #endif >> diff --git a/lib/ipf.c b/lib/ipf.c >> index 45c489122..446e89d13 100644 >> --- a/lib/ipf.c >> +++ b/lib/ipf.c >> @@ -433,9 +433,11 @@ ipf_reassemble_v4_frags(struct ipf_list *ipf_list) >> len += rest_len; >> l3 = dp_packet_l3(pkt); >> ovs_be16 new_ip_frag_off = l3->ip_frag_off & >> ~htons(IP_MORE_FRAGMENTS); >> - l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_frag_off, >> - new_ip_frag_off); >> - l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_tot_len, >> htons(len)); >> + if (!dp_packet_hwol_is_ipv4(pkt)) { >> + l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_frag_off, >> + new_ip_frag_off); >> + l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_tot_len, >> htons(len)); >> + } >> l3->ip_tot_len = htons(len); >> l3->ip_frag_off = new_ip_frag_off; >> dp_packet_set_l2_pad_size(pkt, 0); >> @@ -606,6 +608,7 @@ ipf_is_valid_v4_frag(struct ipf *ipf, struct >> dp_packet *pkt) >> } >> if (OVS_UNLIKELY(!dp_packet_ip_checksum_valid(pkt) >> + && !dp_packet_hwol_is_ipv4(pkt) >> && csum(l3, ip_hdr_len) != 0)) { >> goto invalid_pkt; >> } >> @@ -1181,16 +1184,21 @@ ipf_post_execute_reass_pkts(struct ipf *ipf, >> } else { >> struct ip_header *l3_frag = >> dp_packet_l3(frag_0->pkt); >> struct ip_header *l3_reass = dp_packet_l3(pkt); >> - ovs_be32 reass_ip = >> get_16aligned_be32(&l3_reass->ip_src); >> - ovs_be32 frag_ip = >> get_16aligned_be32(&l3_frag->ip_src); >> - l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum, >> - frag_ip, reass_ip); >> - l3_frag->ip_src = l3_reass->ip_src; >> + if (!dp_packet_hwol_is_ipv4(frag_0->pkt)) { >> + ovs_be32 reass_ip = >> + get_16aligned_be32(&l3_reass->ip_src); >> + ovs_be32 frag_ip = >> + get_16aligned_be32(&l3_frag->ip_src); >> + >> + l3_frag->ip_csum = >> recalc_csum32(l3_frag->ip_csum, >> + frag_ip, >> reass_ip); >> + reass_ip = >> get_16aligned_be32(&l3_reass->ip_dst); >> + frag_ip = get_16aligned_be32(&l3_frag->ip_dst); >> + l3_frag->ip_csum = >> recalc_csum32(l3_frag->ip_csum, >> + frag_ip, >> reass_ip); >> + } >> - reass_ip = get_16aligned_be32(&l3_reass->ip_dst); >> - frag_ip = get_16aligned_be32(&l3_frag->ip_dst); >> - l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum, >> - frag_ip, reass_ip); >> + l3_frag->ip_src = l3_reass->ip_src; >> l3_frag->ip_dst = l3_reass->ip_dst; >> } >> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c >> index d1469f6f2..b108cbd6b 100644 >> --- a/lib/netdev-dpdk.c >> +++ b/lib/netdev-dpdk.c >> @@ -72,6 +72,7 @@ >> #include "timeval.h" >> #include "unaligned.h" >> #include "unixctl.h" >> +#include "userspace-tso.h" >> #include "util.h" >> #include "uuid.h" >> @@ -201,6 +202,8 @@ struct netdev_dpdk_sw_stats { >> uint64_t tx_qos_drops; >> /* Packet drops in ingress policer processing. */ >> uint64_t rx_qos_drops; >> + /* Packet drops in HWOL processing. */ >> + uint64_t tx_invalid_hwol_drops; >> }; >> enum { DPDK_RING_SIZE = 256 }; >> @@ -410,7 +413,8 @@ struct ingress_policer { >> enum dpdk_hw_ol_features { >> NETDEV_RX_CHECKSUM_OFFLOAD = 1 << 0, >> NETDEV_RX_HW_CRC_STRIP = 1 << 1, >> - NETDEV_RX_HW_SCATTER = 1 << 2 >> + NETDEV_RX_HW_SCATTER = 1 << 2, >> + NETDEV_TX_TSO_OFFLOAD = 1 << 3, >> }; >> /* >> @@ -992,6 +996,12 @@ dpdk_eth_dev_port_config(struct netdev_dpdk *dev, >> int n_rxq, int n_txq) >> conf.rxmode.offloads |= DEV_RX_OFFLOAD_KEEP_CRC; >> } >> + if (dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD) { >> + conf.txmode.offloads |= DEV_TX_OFFLOAD_TCP_TSO; >> + conf.txmode.offloads |= DEV_TX_OFFLOAD_TCP_CKSUM; >> + conf.txmode.offloads |= DEV_TX_OFFLOAD_IPV4_CKSUM; >> + } >> + >> /* Limit configured rss hash functions to only those supported >> * by the eth device. */ >> conf.rx_adv_conf.rss_conf.rss_hf &= info.flow_type_rss_offloads; >> @@ -1093,6 +1103,9 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev) >> uint32_t rx_chksm_offload_capa = DEV_RX_OFFLOAD_UDP_CKSUM | >> DEV_RX_OFFLOAD_TCP_CKSUM | >> DEV_RX_OFFLOAD_IPV4_CKSUM; >> + uint32_t tx_tso_offload_capa = DEV_TX_OFFLOAD_TCP_TSO | >> + DEV_TX_OFFLOAD_TCP_CKSUM | >> + DEV_TX_OFFLOAD_IPV4_CKSUM; >> rte_eth_dev_info_get(dev->port_id, &info); >> @@ -1119,6 +1132,14 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev) >> dev->hw_ol_features &= ~NETDEV_RX_HW_SCATTER; >> } >> + if (info.tx_offload_capa & tx_tso_offload_capa) { >> + dev->hw_ol_features |= NETDEV_TX_TSO_OFFLOAD; >> + } else { >> + dev->hw_ol_features &= ~NETDEV_TX_TSO_OFFLOAD; >> + VLOG_WARN("Tx TSO offload is not supported on %s port " >> + DPDK_PORT_ID_FMT, netdev_get_name(&dev->up), >> dev->port_id); >> + } >> + >> n_rxq = MIN(info.max_rx_queues, dev->up.n_rxq); >> n_txq = MIN(info.max_tx_queues, dev->up.n_txq); >> @@ -1369,14 +1390,16 @@ netdev_dpdk_vhost_construct(struct netdev >> *netdev) >> goto out; >> } >> - err = rte_vhost_driver_disable_features(dev->vhost_id, >> - 1ULL << VIRTIO_NET_F_HOST_TSO4 >> - | 1ULL << VIRTIO_NET_F_HOST_TSO6 >> - | 1ULL << VIRTIO_NET_F_CSUM); >> - if (err) { >> - VLOG_ERR("rte_vhost_driver_disable_features failed for vhost >> user " >> - "port: %s\n", name); >> - goto out; >> + if (!userspace_tso_enabled()) { >> + err = rte_vhost_driver_disable_features(dev->vhost_id, >> + 1ULL << VIRTIO_NET_F_HOST_TSO4 >> + | 1ULL << VIRTIO_NET_F_HOST_TSO6 >> + | 1ULL << VIRTIO_NET_F_CSUM); >> + if (err) { >> + VLOG_ERR("rte_vhost_driver_disable_features failed for >> vhost user " >> + "port: %s\n", name); >> + goto out; >> + } >> } >> err = rte_vhost_driver_start(dev->vhost_id); >> @@ -1711,6 +1734,11 @@ netdev_dpdk_get_config(const struct netdev >> *netdev, struct smap *args) >> } else { >> smap_add(args, "rx_csum_offload", "false"); >> } >> + if (dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD) { >> + smap_add(args, "tx_tso_offload", "true"); >> + } else { >> + smap_add(args, "tx_tso_offload", "false"); >> + } >> smap_add(args, "lsc_interrupt_mode", >> dev->lsc_interrupt_mode ? "true" : "false"); >> } >> @@ -2138,6 +2166,67 @@ netdev_dpdk_rxq_dealloc(struct netdev_rxq *rxq) >> rte_free(rx); >> } >> +/* Prepare the packet for HWOL. >> + * Return True if the packet is OK to continue. */ >> +static bool >> +netdev_dpdk_prep_hwol_packet(struct netdev_dpdk *dev, struct rte_mbuf >> *mbuf) >> +{ >> + struct dp_packet *pkt = CONTAINER_OF(mbuf, struct dp_packet, mbuf); >> + >> + if (mbuf->ol_flags & PKT_TX_L4_MASK) { >> + mbuf->l2_len = (char *)dp_packet_l3(pkt) - (char >> *)dp_packet_eth(pkt); >> + mbuf->l3_len = (char *)dp_packet_l4(pkt) - (char >> *)dp_packet_l3(pkt); >> + mbuf->outer_l2_len = 0; >> + mbuf->outer_l3_len = 0; >> + } >> + >> + if (mbuf->ol_flags & PKT_TX_TCP_SEG) { >> + struct tcp_header *th = dp_packet_l4(pkt); >> + >> + if (!th) { >> + VLOG_WARN_RL(&rl, "%s: TCP Segmentation without L4 header" >> + " pkt len: %"PRIu32"", dev->up.name, >> mbuf->pkt_len); >> + return false; >> + } >> + >> + mbuf->l4_len = TCP_OFFSET(th->tcp_ctl) * 4; >> + mbuf->ol_flags |= PKT_TX_TCP_CKSUM; >> + mbuf->tso_segsz = dev->mtu - mbuf->l3_len - mbuf->l4_len; >> + >> + if (mbuf->ol_flags & PKT_TX_IPV4) { >> + mbuf->ol_flags |= PKT_TX_IP_CKSUM; >> + } >> + } >> + return true; >> +} >> + >> +/* Prepare a batch for HWOL. >> + * Return the number of good packets in the batch. */ >> +static int >> +netdev_dpdk_prep_hwol_batch(struct netdev_dpdk *dev, struct rte_mbuf >> **pkts, >> + int pkt_cnt) >> +{ >> + int i = 0; >> + int cnt = 0; >> + struct rte_mbuf *pkt; >> + >> + /* Prepare and filter bad HWOL packets. */ >> + for (i = 0; i < pkt_cnt; i++) { >> + pkt = pkts[i]; >> + if (!netdev_dpdk_prep_hwol_packet(dev, pkt)) { >> + rte_pktmbuf_free(pkt); >> + continue; >> + } >> + >> + if (OVS_UNLIKELY(i != cnt)) { >> + pkts[cnt] = pkt; >> + } >> + cnt++; >> + } >> + >> + return cnt; >> +} >> + >> /* Tries to transmit 'pkts' to txq 'qid' of device 'dev'. Takes >> ownership of >> * 'pkts', even in case of failure. >> * >> @@ -2147,11 +2236,22 @@ netdev_dpdk_eth_tx_burst(struct netdev_dpdk >> *dev, int qid, >> struct rte_mbuf **pkts, int cnt) >> { >> uint32_t nb_tx = 0; >> + uint16_t nb_tx_prep = cnt; >> + >> + if (userspace_tso_enabled()) { >> + nb_tx_prep = rte_eth_tx_prepare(dev->port_id, qid, pkts, cnt); >> + if (nb_tx_prep != cnt) { >> + VLOG_WARN_RL(&rl, "%s: Output batch contains invalid >> packets. " >> + "Only %u/%u are valid: %s", dev->up.name, >> nb_tx_prep, >> + cnt, rte_strerror(rte_errno)); >> + } >> + } >> - while (nb_tx != cnt) { >> + while (nb_tx != nb_tx_prep) { >> uint32_t ret; >> - ret = rte_eth_tx_burst(dev->port_id, qid, pkts + nb_tx, cnt - >> nb_tx); >> + ret = rte_eth_tx_burst(dev->port_id, qid, pkts + nb_tx, >> + nb_tx_prep - nb_tx); >> if (!ret) { >> break; >> } >> @@ -2437,11 +2537,14 @@ netdev_dpdk_filter_packet_len(struct >> netdev_dpdk *dev, struct rte_mbuf **pkts, >> int cnt = 0; >> struct rte_mbuf *pkt; >> + /* Filter oversized packets, unless are marked for TSO. */ >> for (i = 0; i < pkt_cnt; i++) { >> pkt = pkts[i]; >> - if (OVS_UNLIKELY(pkt->pkt_len > dev->max_packet_len)) { >> - VLOG_WARN_RL(&rl, "%s: Too big size %" PRIu32 " >> max_packet_len %d", >> - dev->up.name, pkt->pkt_len, >> dev->max_packet_len); >> + if (OVS_UNLIKELY((pkt->pkt_len > dev->max_packet_len) >> + && !(pkt->ol_flags & PKT_TX_TCP_SEG))) { >> + VLOG_WARN_RL(&rl, "%s: Too big size %" PRIu32 " " >> + "max_packet_len %d", dev->up.name, >> pkt->pkt_len, >> + dev->max_packet_len); >> rte_pktmbuf_free(pkt); >> continue; >> } >> @@ -2463,7 +2566,8 @@ netdev_dpdk_vhost_update_tx_counters(struct >> netdev_dpdk *dev, >> { >> int dropped = sw_stats_add->tx_mtu_exceeded_drops + >> sw_stats_add->tx_qos_drops + >> - sw_stats_add->tx_failure_drops; >> + sw_stats_add->tx_failure_drops + >> + sw_stats_add->tx_invalid_hwol_drops; >> struct netdev_stats *stats = &dev->stats; >> int sent = attempted - dropped; >> int i; >> @@ -2482,6 +2586,7 @@ netdev_dpdk_vhost_update_tx_counters(struct >> netdev_dpdk *dev, >> sw_stats->tx_failure_drops += >> sw_stats_add->tx_failure_drops; >> sw_stats->tx_mtu_exceeded_drops += >> sw_stats_add->tx_mtu_exceeded_drops; >> sw_stats->tx_qos_drops += sw_stats_add->tx_qos_drops; >> + sw_stats->tx_invalid_hwol_drops += >> sw_stats_add->tx_invalid_hwol_drops; >> } >> } >> @@ -2513,8 +2618,15 @@ __netdev_dpdk_vhost_send(struct netdev *netdev, >> int qid, >> rte_spinlock_lock(&dev->tx_q[qid].tx_lock); >> } >> + sw_stats_add.tx_invalid_hwol_drops = cnt; >> + if (userspace_tso_enabled()) { >> + cnt = netdev_dpdk_prep_hwol_batch(dev, cur_pkts, cnt); >> + } >> + >> + sw_stats_add.tx_invalid_hwol_drops -= cnt; >> + sw_stats_add.tx_mtu_exceeded_drops = cnt; >> cnt = netdev_dpdk_filter_packet_len(dev, cur_pkts, cnt); >> - sw_stats_add.tx_mtu_exceeded_drops = total_packets - cnt; >> + sw_stats_add.tx_mtu_exceeded_drops -= cnt; >> /* Check has QoS has been configured for the netdev */ >> sw_stats_add.tx_qos_drops = cnt; >> @@ -2562,6 +2674,120 @@ out: >> } >> } >> +static void >> +netdev_dpdk_extbuf_free(void *addr OVS_UNUSED, void *opaque) >> +{ >> + rte_free(opaque); >> +} >> + >> +static struct rte_mbuf * >> +dpdk_pktmbuf_attach_extbuf(struct rte_mbuf *pkt, uint32_t data_len) >> +{ >> + uint32_t total_len = RTE_PKTMBUF_HEADROOM + data_len; >> + struct rte_mbuf_ext_shared_info *shinfo = NULL; >> + uint16_t buf_len; >> + void *buf; >> + >> + if (rte_pktmbuf_tailroom(pkt) >= sizeof *shinfo) { >> + shinfo = rte_pktmbuf_mtod(pkt, struct >> rte_mbuf_ext_shared_info *); >> + } else { >> + total_len += sizeof *shinfo + sizeof(uintptr_t); >> + total_len = RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t)); >> + } >> + >> + if (OVS_UNLIKELY(total_len > UINT16_MAX)) { >> + VLOG_ERR("Can't copy packet: too big %u", total_len); >> + return NULL; >> + } >> + >> + buf_len = total_len; >> + buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE); >> + if (OVS_UNLIKELY(buf == NULL)) { >> + VLOG_ERR("Failed to allocate memory using rte_malloc: %u", >> buf_len); >> + return NULL; >> + } >> + >> + /* Initialize shinfo. */ >> + if (shinfo) { >> + shinfo->free_cb = netdev_dpdk_extbuf_free; >> + shinfo->fcb_opaque = buf; >> + rte_mbuf_ext_refcnt_set(shinfo, 1); >> + } else { >> + shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len, >> + >> netdev_dpdk_extbuf_free, >> + buf); >> + if (OVS_UNLIKELY(shinfo == NULL)) { >> + rte_free(buf); >> + VLOG_ERR("Failed to initialize shared info for mbuf while " >> + "attempting to attach an external buffer."); >> + return NULL; >> + } >> + } >> + >> + rte_pktmbuf_attach_extbuf(pkt, buf, rte_malloc_virt2iova(buf), >> buf_len, >> + shinfo); >> + rte_pktmbuf_reset_headroom(pkt); >> + >> + return pkt; >> +} >> + >> +static struct rte_mbuf * >> +dpdk_pktmbuf_alloc(struct rte_mempool *mp, uint32_t data_len) >> +{ >> + struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp); >> + >> + if (OVS_UNLIKELY(!pkt)) { >> + return NULL; >> + } >> + >> + if (rte_pktmbuf_tailroom(pkt) >= data_len) { >> + return pkt; >> + } >> + >> + if (dpdk_pktmbuf_attach_extbuf(pkt, data_len)) { >> + return pkt; >> + } >> + >> + rte_pktmbuf_free(pkt); >> + >> + return NULL; >> +} >> + >> +static struct dp_packet * >> +dpdk_copy_dp_packet_to_mbuf(struct rte_mempool *mp, struct dp_packet >> *pkt_orig) >> +{ >> + struct rte_mbuf *mbuf_dest; >> + struct dp_packet *pkt_dest; >> + uint32_t pkt_len; >> + >> + pkt_len = dp_packet_size(pkt_orig); >> + mbuf_dest = dpdk_pktmbuf_alloc(mp, pkt_len); >> + if (OVS_UNLIKELY(mbuf_dest == NULL)) { >> + return NULL; >> + } >> + >> + pkt_dest = CONTAINER_OF(mbuf_dest, struct dp_packet, mbuf); >> + memcpy(dp_packet_data(pkt_dest), dp_packet_data(pkt_orig), pkt_len); >> + dp_packet_set_size(pkt_dest, pkt_len); >> + >> + mbuf_dest->tx_offload = pkt_orig->mbuf.tx_offload; >> + mbuf_dest->packet_type = pkt_orig->mbuf.packet_type; >> + mbuf_dest->ol_flags |= (pkt_orig->mbuf.ol_flags & >> + ~(EXT_ATTACHED_MBUF | IND_ATTACHED_MBUF)); >> + >> + memcpy(&pkt_dest->l2_pad_size, &pkt_orig->l2_pad_size, >> + sizeof(struct dp_packet) - offsetof(struct dp_packet, >> l2_pad_size)); >> + >> + if (mbuf_dest->ol_flags & PKT_TX_L4_MASK) { >> + mbuf_dest->l2_len = (char *)dp_packet_l3(pkt_dest) >> + - (char *)dp_packet_eth(pkt_dest); >> + mbuf_dest->l3_len = (char *)dp_packet_l4(pkt_dest) >> + - (char *) dp_packet_l3(pkt_dest); >> + } >> + >> + return pkt_dest; >> +} >> + >> /* Tx function. Transmit packets indefinitely */ >> static void >> dpdk_do_tx_copy(struct netdev *netdev, int qid, struct >> dp_packet_batch *batch) >> @@ -2575,7 +2801,7 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid, >> struct dp_packet_batch *batch) >> enum { PKT_ARRAY_SIZE = NETDEV_MAX_BURST }; >> #endif >> struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); >> - struct rte_mbuf *pkts[PKT_ARRAY_SIZE]; >> + struct dp_packet *pkts[PKT_ARRAY_SIZE]; >> struct netdev_dpdk_sw_stats *sw_stats = dev->sw_stats; >> uint32_t cnt = batch_cnt; >> uint32_t dropped = 0; >> @@ -2596,34 +2822,30 @@ dpdk_do_tx_copy(struct netdev *netdev, int >> qid, struct dp_packet_batch *batch) >> struct dp_packet *packet = batch->packets[i]; >> uint32_t size = dp_packet_size(packet); >> - if (OVS_UNLIKELY(size > dev->max_packet_len)) { >> - VLOG_WARN_RL(&rl, "Too big size %u max_packet_len %d", >> - size, dev->max_packet_len); >> - >> + if (size > dev->max_packet_len >> + && !(packet->mbuf.ol_flags & PKT_TX_TCP_SEG)) { >> + VLOG_WARN_RL(&rl, "Too big size %u max_packet_len %d", size, >> + dev->max_packet_len); >> mtu_drops++; >> continue; >> } >> - pkts[txcnt] = rte_pktmbuf_alloc(dev->dpdk_mp->mp); >> + pkts[txcnt] = dpdk_copy_dp_packet_to_mbuf(dev->dpdk_mp->mp, >> packet); >> if (OVS_UNLIKELY(!pkts[txcnt])) { >> dropped = cnt - i; >> break; >> } >> - /* We have to do a copy for now */ >> - memcpy(rte_pktmbuf_mtod(pkts[txcnt], void *), >> - dp_packet_data(packet), size); >> - dp_packet_set_size((struct dp_packet *)pkts[txcnt], size); >> - >> txcnt++; >> } >> if (OVS_LIKELY(txcnt)) { >> if (dev->type == DPDK_DEV_VHOST) { >> - __netdev_dpdk_vhost_send(netdev, qid, (struct dp_packet >> **) pkts, >> - txcnt); >> + __netdev_dpdk_vhost_send(netdev, qid, pkts, txcnt); >> } else { >> - tx_failure = netdev_dpdk_eth_tx_burst(dev, qid, pkts, >> txcnt); >> + tx_failure += netdev_dpdk_eth_tx_burst(dev, qid, >> + (struct rte_mbuf >> **)pkts, >> + txcnt); >> } >> } >> @@ -2676,26 +2898,33 @@ netdev_dpdk_send__(struct netdev_dpdk *dev, >> int qid, >> dp_packet_delete_batch(batch, true); >> } else { >> struct netdev_dpdk_sw_stats *sw_stats = dev->sw_stats; >> - int tx_cnt, dropped; >> - int tx_failure, mtu_drops, qos_drops; >> + int dropped; >> + int tx_failure, mtu_drops, qos_drops, hwol_drops; >> int batch_cnt = dp_packet_batch_size(batch); >> struct rte_mbuf **pkts = (struct rte_mbuf **) batch->packets; >> - tx_cnt = netdev_dpdk_filter_packet_len(dev, pkts, batch_cnt); >> - mtu_drops = batch_cnt - tx_cnt; >> - qos_drops = tx_cnt; >> - tx_cnt = netdev_dpdk_qos_run(dev, pkts, tx_cnt, true); >> - qos_drops -= tx_cnt; >> + hwol_drops = batch_cnt; >> + if (userspace_tso_enabled()) { >> + batch_cnt = netdev_dpdk_prep_hwol_batch(dev, pkts, >> batch_cnt); >> + } >> + hwol_drops -= batch_cnt; >> + mtu_drops = batch_cnt; >> + batch_cnt = netdev_dpdk_filter_packet_len(dev, pkts, batch_cnt); >> + mtu_drops -= batch_cnt; >> + qos_drops = batch_cnt; >> + batch_cnt = netdev_dpdk_qos_run(dev, pkts, batch_cnt, true); >> + qos_drops -= batch_cnt; >> - tx_failure = netdev_dpdk_eth_tx_burst(dev, qid, pkts, tx_cnt); >> + tx_failure = netdev_dpdk_eth_tx_burst(dev, qid, pkts, >> batch_cnt); >> - dropped = tx_failure + mtu_drops + qos_drops; >> + dropped = tx_failure + mtu_drops + qos_drops + hwol_drops; >> if (OVS_UNLIKELY(dropped)) { >> rte_spinlock_lock(&dev->stats_lock); >> dev->stats.tx_dropped += dropped; >> sw_stats->tx_failure_drops += tx_failure; >> sw_stats->tx_mtu_exceeded_drops += mtu_drops; >> sw_stats->tx_qos_drops += qos_drops; >> + sw_stats->tx_invalid_hwol_drops += hwol_drops; >> rte_spinlock_unlock(&dev->stats_lock); >> } >> } >> @@ -3011,7 +3240,8 @@ netdev_dpdk_get_sw_custom_stats(const struct >> netdev *netdev, >> SW_CSTAT(tx_failure_drops) \ >> SW_CSTAT(tx_mtu_exceeded_drops) \ >> SW_CSTAT(tx_qos_drops) \ >> - SW_CSTAT(rx_qos_drops) >> + SW_CSTAT(rx_qos_drops) \ >> + SW_CSTAT(tx_invalid_hwol_drops) >> #define SW_CSTAT(NAME) + 1 >> custom_stats->size = SW_CSTATS; >> @@ -4874,6 +5104,12 @@ netdev_dpdk_reconfigure(struct netdev *netdev) >> rte_free(dev->tx_q); >> err = dpdk_eth_dev_init(dev); >> + if (dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD) { >> + netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_TSO; >> + netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_CKSUM; >> + netdev->ol_flags |= NETDEV_TX_OFFLOAD_IPV4_CKSUM; >> + } >> + >> dev->tx_q = netdev_dpdk_alloc_txq(netdev->n_txq); >> if (!dev->tx_q) { >> err = ENOMEM; >> @@ -4903,6 +5139,11 @@ dpdk_vhost_reconfigure_helper(struct >> netdev_dpdk *dev) >> dev->tx_q[0].map = 0; >> } >> + if (userspace_tso_enabled()) { >> + dev->hw_ol_features |= NETDEV_TX_TSO_OFFLOAD; >> + VLOG_DBG("%s: TSO enabled on vhost port", >> netdev_get_name(&dev->up)); >> + } >> + >> netdev_dpdk_remap_txqs(dev); >> err = netdev_dpdk_mempool_configure(dev); >> @@ -4975,6 +5216,11 @@ netdev_dpdk_vhost_client_reconfigure(struct >> netdev *netdev) >> vhost_flags |= RTE_VHOST_USER_DEQUEUE_ZERO_COPY; >> } >> + /* Enable External Buffers if TCP Segmentation Offload is >> enabled. */ >> + if (userspace_tso_enabled()) { >> + vhost_flags |= RTE_VHOST_USER_EXTBUF_SUPPORT; >> + } >> + >> err = rte_vhost_driver_register(dev->vhost_id, vhost_flags); >> if (err) { >> VLOG_ERR("vhost-user device setup failure for device %s\n", >> @@ -4999,14 +5245,20 @@ netdev_dpdk_vhost_client_reconfigure(struct >> netdev *netdev) >> goto unlock; >> } >> - err = rte_vhost_driver_disable_features(dev->vhost_id, >> - 1ULL << VIRTIO_NET_F_HOST_TSO4 >> - | 1ULL << VIRTIO_NET_F_HOST_TSO6 >> - | 1ULL << VIRTIO_NET_F_CSUM); >> - if (err) { >> - VLOG_ERR("rte_vhost_driver_disable_features failed for >> vhost user " >> - "client port: %s\n", dev->up.name); >> - goto unlock; >> + if (userspace_tso_enabled()) { >> + netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_TSO; >> + netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_CKSUM; >> + netdev->ol_flags |= NETDEV_TX_OFFLOAD_IPV4_CKSUM; >> + } else { >> + err = rte_vhost_driver_disable_features(dev->vhost_id, >> + 1ULL << VIRTIO_NET_F_HOST_TSO4 >> + | 1ULL << VIRTIO_NET_F_HOST_TSO6 >> + | 1ULL << VIRTIO_NET_F_CSUM); >> + if (err) { >> + VLOG_ERR("rte_vhost_driver_disable_features failed for " >> + "vhost user client port: %s\n", dev->up.name); >> + goto unlock; >> + } >> } >> err = rte_vhost_driver_start(dev->vhost_id); >> diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h >> index f08159aa7..9dbc67658 100644 >> --- a/lib/netdev-linux-private.h >> +++ b/lib/netdev-linux-private.h >> @@ -27,6 +27,7 @@ >> #include <stdint.h> >> #include <stdbool.h> >> +#include "dp-packet.h" >> #include "netdev-afxdp.h" >> #include "netdev-afxdp-pool.h" >> #include "netdev-provider.h" >> @@ -37,10 +38,13 @@ >> struct netdev; >> +#define LINUX_RXQ_TSO_MAX_LEN 65536 >> + >> struct netdev_rxq_linux { >> struct netdev_rxq up; >> bool is_tap; >> int fd; >> + char *aux_bufs[NETDEV_MAX_BURST]; /* Batch of preallocated TSO >> buffers. */ >> }; >> int netdev_linux_construct(struct netdev *); >> @@ -92,6 +96,7 @@ struct netdev_linux { >> int tap_fd; >> bool present; /* If the device is present in the >> namespace */ >> uint64_t tx_dropped; /* tap device can drop if the iface >> is down */ >> + uint64_t rx_dropped; /* Packets dropped while recv from >> kernel. */ >> /* LAG information. */ >> bool is_lag_master; /* True if the netdev is a LAG >> master. */ >> diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c >> index 41d1e9273..a4a666657 100644 >> --- a/lib/netdev-linux.c >> +++ b/lib/netdev-linux.c >> @@ -29,16 +29,18 @@ >> #include <linux/filter.h> >> #include <linux/gen_stats.h> >> #include <linux/if_ether.h> >> +#include <linux/if_packet.h> >> #include <linux/if_tun.h> >> #include <linux/types.h> >> #include <linux/ethtool.h> >> #include <linux/mii.h> >> #include <linux/rtnetlink.h> >> #include <linux/sockios.h> >> +#include <linux/virtio_net.h> >> #include <sys/ioctl.h> >> #include <sys/socket.h> >> +#include <sys/uio.h> >> #include <sys/utsname.h> >> -#include <netpacket/packet.h> >> #include <net/if.h> >> #include <net/if_arp.h> >> #include <net/route.h> >> @@ -75,6 +77,7 @@ >> #include "timer.h" >> #include "unaligned.h" >> #include "openvswitch/vlog.h" >> +#include "userspace-tso.h" >> #include "util.h" >> VLOG_DEFINE_THIS_MODULE(netdev_linux); >> @@ -237,6 +240,16 @@ enum { >> VALID_DRVINFO = 1 << 6, >> VALID_FEATURES = 1 << 7, >> }; >> + >> +/* Use one for the packet buffer and another for the aux buffer to >> receive >> + * TSO packets. */ >> +#define IOV_STD_SIZE 1 >> +#define IOV_TSO_SIZE 2 >> + >> +enum { >> + IOV_PACKET = 0, >> + IOV_AUXBUF = 1, >> +}; >> >> struct linux_lag_slave { >> uint32_t block_id; >> @@ -501,6 +514,8 @@ static struct vlog_rate_limit rl = >> VLOG_RATE_LIMIT_INIT(5, 20); >> * changes in the device miimon status, so we can use atomic_count. */ >> static atomic_count miimon_cnt = ATOMIC_COUNT_INIT(0); >> +static int netdev_linux_parse_vnet_hdr(struct dp_packet *b); >> +static void netdev_linux_prepend_vnet_hdr(struct dp_packet *b, int mtu); >> static int netdev_linux_do_ethtool(const char *name, struct >> ethtool_cmd *, >> int cmd, const char *cmd_name); >> static int get_flags(const struct netdev *, unsigned int *flags); >> @@ -902,6 +917,13 @@ netdev_linux_common_construct(struct netdev >> *netdev_) >> /* The device could be in the same network namespace or in >> another one. */ >> netnsid_unset(&netdev->netnsid); >> ovs_mutex_init(&netdev->mutex); >> + >> + if (userspace_tso_enabled()) { >> + netdev_->ol_flags |= NETDEV_TX_OFFLOAD_TCP_TSO; >> + netdev_->ol_flags |= NETDEV_TX_OFFLOAD_TCP_CKSUM; >> + netdev_->ol_flags |= NETDEV_TX_OFFLOAD_IPV4_CKSUM; >> + } >> + >> return 0; >> } >> @@ -961,6 +983,10 @@ netdev_linux_construct_tap(struct netdev *netdev_) >> /* Create tap device. */ >> get_flags(&netdev->up, &netdev->ifi_flags); >> ifr.ifr_flags = IFF_TAP | IFF_NO_PI; >> + if (userspace_tso_enabled()) { >> + ifr.ifr_flags |= IFF_VNET_HDR; >> + } >> + >> ovs_strzcpy(ifr.ifr_name, name, sizeof ifr.ifr_name); >> if (ioctl(netdev->tap_fd, TUNSETIFF, &ifr) == -1) { >> VLOG_WARN("%s: creating tap device failed: %s", name, >> @@ -1024,6 +1050,15 @@ static struct netdev_rxq * >> netdev_linux_rxq_alloc(void) >> { >> struct netdev_rxq_linux *rx = xzalloc(sizeof *rx); >> + if (userspace_tso_enabled()) { >> + int i; >> + >> + /* Allocate auxiliay buffers to receive TSO packets. */ >> + for (i = 0; i < NETDEV_MAX_BURST; i++) { >> + rx->aux_bufs[i] = xmalloc(LINUX_RXQ_TSO_MAX_LEN); >> + } >> + } >> + >> return &rx->up; >> } >> @@ -1069,6 +1104,15 @@ netdev_linux_rxq_construct(struct netdev_rxq >> *rxq_) >> goto error; >> } >> + if (userspace_tso_enabled() >> + && setsockopt(rx->fd, SOL_PACKET, PACKET_VNET_HDR, &val, >> + sizeof val)) { >> + error = errno; >> + VLOG_ERR("%s: failed to enable vnet hdr in txq raw >> socket: %s", >> + netdev_get_name(netdev_), ovs_strerror(errno)); >> + goto error; >> + } >> + >> /* Set non-blocking mode. */ >> error = set_nonblocking(rx->fd); >> if (error) { >> @@ -1119,10 +1163,15 @@ static void >> netdev_linux_rxq_destruct(struct netdev_rxq *rxq_) >> { >> struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_); >> + int i; >> if (!rx->is_tap) { >> close(rx->fd); >> } >> + >> + for (i = 0; i < NETDEV_MAX_BURST; i++) { >> + free(rx->aux_bufs[i]); >> + } >> } >> static void >> @@ -1159,12 +1208,14 @@ auxdata_has_vlan_tci(const struct >> tpacket_auxdata *aux) >> * It also used recvmmsg to reduce multiple syscalls overhead; >> */ >> static int >> -netdev_linux_batch_rxq_recv_sock(int fd, int mtu, >> +netdev_linux_batch_rxq_recv_sock(struct netdev_rxq_linux *rx, int mtu, >> struct dp_packet_batch *batch) >> { >> - size_t size; >> + int iovlen; >> + size_t std_len; >> ssize_t retval; >> - struct iovec iovs[NETDEV_MAX_BURST]; >> + int virtio_net_hdr_size; >> + struct iovec iovs[NETDEV_MAX_BURST][IOV_TSO_SIZE]; >> struct cmsghdr *cmsg; >> union { >> struct cmsghdr cmsg; >> @@ -1174,41 +1225,87 @@ netdev_linux_batch_rxq_recv_sock(int fd, int mtu, >> struct dp_packet *buffers[NETDEV_MAX_BURST]; >> int i; >> + if (userspace_tso_enabled()) { >> + /* Use the buffer from the allocated packet below to receive MTU >> + * sized packets and an aux_buf for extra TSO data. */ >> + iovlen = IOV_TSO_SIZE; >> + virtio_net_hdr_size = sizeof(struct virtio_net_hdr); >> + } else { >> + /* Use only the buffer from the allocated packet. */ >> + iovlen = IOV_STD_SIZE; >> + virtio_net_hdr_size = 0; >> + } >> + >> + std_len = VLAN_ETH_HEADER_LEN + mtu + virtio_net_hdr_size; >> for (i = 0; i < NETDEV_MAX_BURST; i++) { >> - buffers[i] = dp_packet_new_with_headroom(VLAN_ETH_HEADER_LEN >> + mtu, >> - DP_NETDEV_HEADROOM); >> - /* Reserve headroom for a single VLAN tag */ >> - dp_packet_reserve(buffers[i], VLAN_HEADER_LEN); >> - size = dp_packet_tailroom(buffers[i]); >> - iovs[i].iov_base = dp_packet_data(buffers[i]); >> - iovs[i].iov_len = size; >> + buffers[i] = dp_packet_new_with_headroom(std_len, >> DP_NETDEV_HEADROOM); >> + iovs[i][IOV_PACKET].iov_base = dp_packet_data(buffers[i]); >> + iovs[i][IOV_PACKET].iov_len = std_len; >> + iovs[i][IOV_AUXBUF].iov_base = rx->aux_bufs[i]; >> + iovs[i][IOV_AUXBUF].iov_len = LINUX_RXQ_TSO_MAX_LEN; >> mmsgs[i].msg_hdr.msg_name = NULL; >> mmsgs[i].msg_hdr.msg_namelen = 0; >> - mmsgs[i].msg_hdr.msg_iov = &iovs[i]; >> - mmsgs[i].msg_hdr.msg_iovlen = 1; >> + mmsgs[i].msg_hdr.msg_iov = iovs[i]; >> + mmsgs[i].msg_hdr.msg_iovlen = iovlen; >> mmsgs[i].msg_hdr.msg_control = &cmsg_buffers[i]; >> mmsgs[i].msg_hdr.msg_controllen = sizeof cmsg_buffers[i]; >> mmsgs[i].msg_hdr.msg_flags = 0; >> } >> do { >> - retval = recvmmsg(fd, mmsgs, NETDEV_MAX_BURST, MSG_TRUNC, NULL); >> + retval = recvmmsg(rx->fd, mmsgs, NETDEV_MAX_BURST, MSG_TRUNC, >> NULL); >> } while (retval < 0 && errno == EINTR); >> if (retval < 0) { >> - /* Save -errno to retval temporarily */ >> - retval = -errno; >> - i = 0; >> - goto free_buffers; >> + retval = errno; >> + for (i = 0; i < NETDEV_MAX_BURST; i++) { >> + dp_packet_delete(buffers[i]); >> + } >> + >> + return retval; >> } >> for (i = 0; i < retval; i++) { >> if (mmsgs[i].msg_len < ETH_HEADER_LEN) { >> - break; >> + struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up); >> + struct netdev_linux *netdev = netdev_linux_cast(netdev_); >> + >> + dp_packet_delete(buffers[i]); >> + netdev->rx_dropped += 1; >> + VLOG_WARN_RL(&rl, "%s: Dropped packet: less than ether >> hdr size", >> + netdev_get_name(netdev_)); >> + continue; >> + } >> + >> + if (mmsgs[i].msg_len > std_len) { >> + /* Build a single linear TSO packet by expanding the >> current packet >> + * to append the data received in the aux_buf. */ >> + size_t extra_len = mmsgs[i].msg_len - std_len; >> + >> + dp_packet_set_size(buffers[i], dp_packet_size(buffers[i]) >> + + std_len); >> + dp_packet_prealloc_tailroom(buffers[i], extra_len); >> + memcpy(dp_packet_tail(buffers[i]), rx->aux_bufs[i], >> extra_len); >> + dp_packet_set_size(buffers[i], dp_packet_size(buffers[i]) >> + + extra_len); >> + } else { >> + dp_packet_set_size(buffers[i], dp_packet_size(buffers[i]) >> + + mmsgs[i].msg_len); >> } >> - dp_packet_set_size(buffers[i], >> - dp_packet_size(buffers[i]) + >> mmsgs[i].msg_len); >> + if (virtio_net_hdr_size && >> netdev_linux_parse_vnet_hdr(buffers[i])) { >> + struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up); >> + struct netdev_linux *netdev = netdev_linux_cast(netdev_); >> + >> + /* Unexpected error situation: the virtio header is not >> present >> + * or corrupted. Drop the packet but continue in case >> next ones >> + * are correct. */ >> + dp_packet_delete(buffers[i]); >> + netdev->rx_dropped += 1; >> + VLOG_WARN_RL(&rl, "%s: Dropped packet: Invalid virtio net >> header", >> + netdev_get_name(netdev_)); >> + continue; >> + } >> for (cmsg = CMSG_FIRSTHDR(&mmsgs[i].msg_hdr); cmsg; >> cmsg = CMSG_NXTHDR(&mmsgs[i].msg_hdr, cmsg)) { >> @@ -1238,22 +1335,11 @@ netdev_linux_batch_rxq_recv_sock(int fd, int mtu, >> dp_packet_batch_add(batch, buffers[i]); >> } >> -free_buffers: >> - /* Free unused buffers, including buffers whose size is less than >> - * ETH_HEADER_LEN. >> - * >> - * Note: i has been set correctly by the above for loop, so don't >> - * try to re-initialize it. >> - */ >> + /* Delete unused buffers. */ >> for (; i < NETDEV_MAX_BURST; i++) { >> dp_packet_delete(buffers[i]); >> } >> - /* netdev_linux_rxq_recv needs it to return 0 or positive errno */ >> - if (retval < 0) { >> - return -retval; >> - } >> - >> return 0; >> } >> @@ -1263,20 +1349,40 @@ free_buffers: >> * packets are added into *batch. The return value is 0 or errno. >> */ >> static int >> -netdev_linux_batch_rxq_recv_tap(int fd, int mtu, struct >> dp_packet_batch *batch) >> +netdev_linux_batch_rxq_recv_tap(struct netdev_rxq_linux *rx, int mtu, >> + struct dp_packet_batch *batch) >> { >> struct dp_packet *buffer; >> + int virtio_net_hdr_size; >> ssize_t retval; >> - size_t size; >> + size_t std_len; >> + int iovlen; >> int i; >> + if (userspace_tso_enabled()) { >> + /* Use the buffer from the allocated packet below to receive MTU >> + * sized packets and an aux_buf for extra TSO data. */ >> + iovlen = IOV_TSO_SIZE; >> + virtio_net_hdr_size = sizeof(struct virtio_net_hdr); >> + } else { >> + /* Use only the buffer from the allocated packet. */ >> + iovlen = IOV_STD_SIZE; >> + virtio_net_hdr_size = 0; >> + } >> + >> + std_len = VLAN_ETH_HEADER_LEN + mtu + virtio_net_hdr_size; >> for (i = 0; i < NETDEV_MAX_BURST; i++) { >> + struct iovec iov[IOV_TSO_SIZE]; >> + >> /* Assume Ethernet port. No need to set packet_type. */ >> - buffer = dp_packet_new_with_headroom(VLAN_ETH_HEADER_LEN + mtu, >> - DP_NETDEV_HEADROOM); >> - size = dp_packet_tailroom(buffer); >> + buffer = dp_packet_new_with_headroom(std_len, >> DP_NETDEV_HEADROOM); >> + iov[IOV_PACKET].iov_base = dp_packet_data(buffer); >> + iov[IOV_PACKET].iov_len = std_len; >> + iov[IOV_AUXBUF].iov_base = rx->aux_bufs[i]; >> + iov[IOV_AUXBUF].iov_len = LINUX_RXQ_TSO_MAX_LEN; >> + >> do { >> - retval = read(fd, dp_packet_data(buffer), size); >> + retval = readv(rx->fd, iov, iovlen); >> } while (retval < 0 && errno == EINTR); >> if (retval < 0) { >> @@ -1284,7 +1390,33 @@ netdev_linux_batch_rxq_recv_tap(int fd, int >> mtu, struct dp_packet_batch *batch) >> break; >> } >> - dp_packet_set_size(buffer, dp_packet_size(buffer) + retval); >> + if (retval > std_len) { >> + /* Build a single linear TSO packet by expanding the >> current packet >> + * to append the data received in the aux_buf. */ >> + size_t extra_len = retval - std_len; >> + >> + dp_packet_set_size(buffer, dp_packet_size(buffer) + >> std_len); >> + dp_packet_prealloc_tailroom(buffer, extra_len); >> + memcpy(dp_packet_tail(buffer), rx->aux_bufs[i], extra_len); >> + dp_packet_set_size(buffer, dp_packet_size(buffer) + >> extra_len); >> + } else { >> + dp_packet_set_size(buffer, dp_packet_size(buffer) + retval); >> + } >> + >> + if (virtio_net_hdr_size && >> netdev_linux_parse_vnet_hdr(buffer)) { >> + struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up); >> + struct netdev_linux *netdev = netdev_linux_cast(netdev_); >> + >> + /* Unexpected error situation: the virtio header is not >> present >> + * or corrupted. Drop the packet but continue in case >> next ones >> + * are correct. */ >> + dp_packet_delete(buffer); >> + netdev->rx_dropped += 1; >> + VLOG_WARN_RL(&rl, "%s: Dropped packet: Invalid virtio net >> header", >> + netdev_get_name(netdev_)); >> + continue; >> + } >> + >> dp_packet_batch_add(batch, buffer); >> } >> @@ -1310,8 +1442,8 @@ netdev_linux_rxq_recv(struct netdev_rxq *rxq_, >> struct dp_packet_batch *batch, >> dp_packet_batch_init(batch); >> retval = (rx->is_tap >> - ? netdev_linux_batch_rxq_recv_tap(rx->fd, mtu, batch) >> - : netdev_linux_batch_rxq_recv_sock(rx->fd, mtu, batch)); >> + ? netdev_linux_batch_rxq_recv_tap(rx, mtu, batch) >> + : netdev_linux_batch_rxq_recv_sock(rx, mtu, batch)); >> if (retval) { >> if (retval != EAGAIN && retval != EMSGSIZE) { >> @@ -1353,7 +1485,7 @@ netdev_linux_rxq_drain(struct netdev_rxq *rxq_) >> } >> static int >> -netdev_linux_sock_batch_send(int sock, int ifindex, >> +netdev_linux_sock_batch_send(int sock, int ifindex, bool tso, int mtu, >> struct dp_packet_batch *batch) >> { >> const size_t size = dp_packet_batch_size(batch); >> @@ -1367,6 +1499,10 @@ netdev_linux_sock_batch_send(int sock, int >> ifindex, >> struct dp_packet *packet; >> DP_PACKET_BATCH_FOR_EACH (i, packet, batch) { >> + if (tso) { >> + netdev_linux_prepend_vnet_hdr(packet, mtu); >> + } >> + >> iov[i].iov_base = dp_packet_data(packet); >> iov[i].iov_len = dp_packet_size(packet); >> mmsg[i].msg_hdr = (struct msghdr) { .msg_name = &sll, >> @@ -1399,7 +1535,7 @@ netdev_linux_sock_batch_send(int sock, int ifindex, >> * on other interface types because we attach a socket filter to the rx >> * socket. */ >> static int >> -netdev_linux_tap_batch_send(struct netdev *netdev_, >> +netdev_linux_tap_batch_send(struct netdev *netdev_, bool tso, int mtu, >> struct dp_packet_batch *batch) >> { >> struct netdev_linux *netdev = netdev_linux_cast(netdev_); >> @@ -1416,10 +1552,15 @@ netdev_linux_tap_batch_send(struct netdev >> *netdev_, >> } >> DP_PACKET_BATCH_FOR_EACH (i, packet, batch) { >> - size_t size = dp_packet_size(packet); >> + size_t size; >> ssize_t retval; >> int error; >> + if (tso) { >> + netdev_linux_prepend_vnet_hdr(packet, mtu); >> + } >> + >> + size = dp_packet_size(packet); >> do { >> retval = write(netdev->tap_fd, dp_packet_data(packet), >> size); >> error = retval < 0 ? errno : 0; >> @@ -1454,9 +1595,15 @@ netdev_linux_send(struct netdev *netdev_, int >> qid OVS_UNUSED, >> struct dp_packet_batch *batch, >> bool concurrent_txq OVS_UNUSED) >> { >> + bool tso = userspace_tso_enabled(); >> + int mtu = ETH_PAYLOAD_MAX; >> int error = 0; >> int sock = 0; >> + if (tso) { >> + netdev_linux_get_mtu__(netdev_linux_cast(netdev_), &mtu); >> + } >> + >> if (!is_tap_netdev(netdev_)) { >> if >> (netdev_linux_netnsid_is_remote(netdev_linux_cast(netdev_))) { >> error = EOPNOTSUPP; >> @@ -1475,9 +1622,9 @@ netdev_linux_send(struct netdev *netdev_, int >> qid OVS_UNUSED, >> goto free_batch; >> } >> - error = netdev_linux_sock_batch_send(sock, ifindex, batch); >> + error = netdev_linux_sock_batch_send(sock, ifindex, tso, mtu, >> batch); >> } else { >> - error = netdev_linux_tap_batch_send(netdev_, batch); >> + error = netdev_linux_tap_batch_send(netdev_, tso, mtu, batch); >> } >> if (error) { >> if (error == ENOBUFS) { >> @@ -2045,6 +2192,7 @@ netdev_tap_get_stats(const struct netdev >> *netdev_, struct netdev_stats *stats) >> stats->collisions += dev_stats.collisions; >> } >> stats->tx_dropped += netdev->tx_dropped; >> + stats->rx_dropped += netdev->rx_dropped; >> ovs_mutex_unlock(&netdev->mutex); >> return error; >> @@ -6223,6 +6371,17 @@ af_packet_sock(void) >> if (error) { >> close(sock); >> sock = -error; >> + } else if (userspace_tso_enabled()) { >> + int val = 1; >> + error = setsockopt(sock, SOL_PACKET, PACKET_VNET_HDR, >> &val, >> + sizeof val); >> + if (error) { >> + error = errno; >> + VLOG_ERR("failed to enable vnet hdr in raw >> socket: %s", >> + ovs_strerror(errno)); >> + close(sock); >> + sock = -error; >> + } >> } >> } else { >> sock = -errno; >> @@ -6234,3 +6393,136 @@ af_packet_sock(void) >> return sock; >> } >> + >> +static int >> +netdev_linux_parse_l2(struct dp_packet *b, uint16_t *l4proto) >> +{ >> + struct eth_header *eth_hdr; >> + ovs_be16 eth_type; >> + int l2_len; >> + >> + eth_hdr = dp_packet_at(b, 0, ETH_HEADER_LEN); >> + if (!eth_hdr) { >> + return -EINVAL; >> + } >> + >> + l2_len = ETH_HEADER_LEN; >> + eth_type = eth_hdr->eth_type; >> + if (eth_type_vlan(eth_type)) { >> + struct vlan_header *vlan = dp_packet_at(b, l2_len, >> VLAN_HEADER_LEN); >> + >> + if (!vlan) { >> + return -EINVAL; >> + } >> + >> + eth_type = vlan->vlan_next_type; >> + l2_len += VLAN_HEADER_LEN; >> + } >> + >> + if (eth_type == htons(ETH_TYPE_IP)) { >> + struct ip_header *ip_hdr = dp_packet_at(b, l2_len, >> IP_HEADER_LEN); >> + >> + if (!ip_hdr) { >> + return -EINVAL; >> + } >> + >> + *l4proto = ip_hdr->ip_proto; >> + dp_packet_hwol_set_tx_ipv4(b); >> + } else if (eth_type == htons(ETH_TYPE_IPV6)) { >> + struct ovs_16aligned_ip6_hdr *nh6; >> + >> + nh6 = dp_packet_at(b, l2_len, IPV6_HEADER_LEN); >> + if (!nh6) { >> + return -EINVAL; >> + } >> + >> + *l4proto = nh6->ip6_ctlun.ip6_un1.ip6_un1_nxt; >> + dp_packet_hwol_set_tx_ipv6(b); >> + } >> + >> + return 0; >> +} >> + >> +static int >> +netdev_linux_parse_vnet_hdr(struct dp_packet *b) >> +{ >> + struct virtio_net_hdr *vnet = dp_packet_pull(b, sizeof *vnet); >> + uint16_t l4proto = 0; >> + >> + if (OVS_UNLIKELY(!vnet)) { >> + return -EINVAL; >> + } >> + >> + if (vnet->flags == 0 && vnet->gso_type == VIRTIO_NET_HDR_GSO_NONE) { >> + return 0; >> + } >> + >> + if (netdev_linux_parse_l2(b, &l4proto)) { >> + return -EINVAL; >> + } >> + >> + if (vnet->flags == VIRTIO_NET_HDR_F_NEEDS_CSUM) { >> + if (l4proto == IPPROTO_TCP) { >> + dp_packet_hwol_set_csum_tcp(b); >> + } else if (l4proto == IPPROTO_UDP) { >> + dp_packet_hwol_set_csum_udp(b); >> + } else if (l4proto == IPPROTO_SCTP) { >> + dp_packet_hwol_set_csum_sctp(b); >> + } >> + } >> + >> + if (l4proto && vnet->gso_type != VIRTIO_NET_HDR_GSO_NONE) { >> + uint8_t allowed_mask = VIRTIO_NET_HDR_GSO_TCPV4 >> + | VIRTIO_NET_HDR_GSO_TCPV6 >> + | VIRTIO_NET_HDR_GSO_UDP; >> + uint8_t type = vnet->gso_type & allowed_mask; >> + >> + if (type == VIRTIO_NET_HDR_GSO_TCPV4 >> + || type == VIRTIO_NET_HDR_GSO_TCPV6) { >> + dp_packet_hwol_set_tcp_seg(b); >> + } >> + } >> + >> + return 0; >> +} >> + >> +static void >> +netdev_linux_prepend_vnet_hdr(struct dp_packet *b, int mtu) >> +{ >> + struct virtio_net_hdr *vnet = dp_packet_push_zeros(b, sizeof *vnet); >> + >> + if (dp_packet_hwol_is_tso(b)) { >> + uint16_t hdr_len = ((char *)dp_packet_l4(b) - (char >> *)dp_packet_eth(b)) >> + + TCP_HEADER_LEN; >> + >> + vnet->hdr_len = (OVS_FORCE __virtio16)hdr_len; >> + vnet->gso_size = (OVS_FORCE __virtio16)(mtu - hdr_len); >> + if (dp_packet_hwol_is_ipv4(b)) { >> + vnet->gso_type = VIRTIO_NET_HDR_GSO_TCPV4; >> + } else { >> + vnet->gso_type = VIRTIO_NET_HDR_GSO_TCPV6; >> + } >> + >> + } else { >> + vnet->flags = VIRTIO_NET_HDR_GSO_NONE; >> + } >> + >> + if (dp_packet_hwol_l4_mask(b)) { >> + vnet->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; >> + vnet->csum_start = (OVS_FORCE __virtio16)((char >> *)dp_packet_l4(b) >> + - (char >> *)dp_packet_eth(b)); >> + >> + if (dp_packet_hwol_l4_is_tcp(b)) { >> + vnet->csum_offset = (OVS_FORCE __virtio16) >> __builtin_offsetof( >> + struct tcp_header, tcp_csum); >> + } else if (dp_packet_hwol_l4_is_udp(b)) { >> + vnet->csum_offset = (OVS_FORCE __virtio16) >> __builtin_offsetof( >> + struct udp_header, udp_csum); >> + } else if (dp_packet_hwol_l4_is_sctp(b)) { >> + vnet->csum_offset = (OVS_FORCE __virtio16) >> __builtin_offsetof( >> + struct sctp_header, sctp_csum); >> + } else { >> + VLOG_WARN_RL(&rl, "Unsupported L4 protocol"); >> + } >> + } >> +} >> diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h >> index f109c4e66..22f4cde33 100644 >> --- a/lib/netdev-provider.h >> +++ b/lib/netdev-provider.h >> @@ -37,6 +37,12 @@ extern "C" { >> struct netdev_tnl_build_header_params; >> #define NETDEV_NUMA_UNSPEC OVS_NUMA_UNSPEC >> +enum netdev_ol_flags { >> + NETDEV_TX_OFFLOAD_IPV4_CKSUM = 1 << 0, >> + NETDEV_TX_OFFLOAD_TCP_CKSUM = 1 << 1, >> + NETDEV_TX_OFFLOAD_TCP_TSO = 1 << 2, >> +}; >> + >> /* A network device (e.g. an Ethernet device). >> * >> * Network device implementations may read these members but should >> not modify >> @@ -51,6 +57,9 @@ struct netdev { >> * opening this device, and therefore got assigned to the >> "system" class */ >> bool auto_classified; >> + /* This bitmask of the offloading features enabled by the netdev. */ >> + uint64_t ol_flags; >> + >> /* If this is 'true', the user explicitly specified an MTU for this >> * netdev. Otherwise, Open vSwitch is allowed to override it. */ >> bool mtu_user_config; >> diff --git a/lib/netdev.c b/lib/netdev.c >> index 405c98c68..f95b19af4 100644 >> --- a/lib/netdev.c >> +++ b/lib/netdev.c >> @@ -66,6 +66,8 @@ COVERAGE_DEFINE(netdev_received); >> COVERAGE_DEFINE(netdev_sent); >> COVERAGE_DEFINE(netdev_add_router); >> COVERAGE_DEFINE(netdev_get_stats); >> +COVERAGE_DEFINE(netdev_send_prepare_drops); >> +COVERAGE_DEFINE(netdev_push_header_drops); >> struct netdev_saved_flags { >> struct netdev *netdev; >> @@ -782,6 +784,54 @@ netdev_get_pt_mode(const struct netdev *netdev) >> : NETDEV_PT_LEGACY_L2); >> } >> +/* Check if a 'packet' is compatible with 'netdev_flags'. >> + * If a packet is incompatible, return 'false' with the 'errormsg' >> + * pointing to a reason. */ >> +static bool >> +netdev_send_prepare_packet(const uint64_t netdev_flags, >> + struct dp_packet *packet, char **errormsg) >> +{ >> + if (dp_packet_hwol_is_tso(packet) >> + && !(netdev_flags & NETDEV_TX_OFFLOAD_TCP_TSO)) { >> + /* Fall back to GSO in software. */ >> + VLOG_ERR_BUF(errormsg, "No TSO support"); >> + return false; >> + } >> + >> + if (dp_packet_hwol_l4_mask(packet) >> + && !(netdev_flags & NETDEV_TX_OFFLOAD_TCP_CKSUM)) { >> + /* Fall back to L4 csum in software. */ >> + VLOG_ERR_BUF(errormsg, "No L4 checksum support"); >> + return false; >> + } >> + >> + return true; >> +} >> + >> +/* Check if each packet in 'batch' is compatible with 'netdev' features, >> + * otherwise either fall back to software implementation or drop it. */ >> +static void >> +netdev_send_prepare_batch(const struct netdev *netdev, >> + struct dp_packet_batch *batch) >> +{ >> + struct dp_packet *packet; >> + size_t i, size = dp_packet_batch_size(batch); >> + >> + DP_PACKET_BATCH_REFILL_FOR_EACH (i, size, packet, batch) { >> + char *errormsg = NULL; >> + >> + if (netdev_send_prepare_packet(netdev->ol_flags, packet, >> &errormsg)) { >> + dp_packet_batch_refill(batch, packet, i); >> + } else { >> + dp_packet_delete(packet); >> + COVERAGE_INC(netdev_send_prepare_drops); >> + VLOG_WARN_RL(&rl, "%s: Packet dropped: %s", >> + netdev_get_name(netdev), errormsg); >> + free(errormsg); >> + } >> + } >> +} >> + >> /* Sends 'batch' on 'netdev'. Returns 0 if successful (for every >> packet), >> * otherwise a positive errno value. Returns EAGAIN without >> blocking if >> * at least one the packets cannot be queued immediately. Returns >> EMSGSIZE >> @@ -811,8 +861,14 @@ int >> netdev_send(struct netdev *netdev, int qid, struct dp_packet_batch >> *batch, >> bool concurrent_txq) >> { >> - int error = netdev->netdev_class->send(netdev, qid, batch, >> - concurrent_txq); >> + int error; >> + >> + netdev_send_prepare_batch(netdev, batch); >> + if (OVS_UNLIKELY(dp_packet_batch_is_empty(batch))) { >> + return 0; >> + } >> + >> + error = netdev->netdev_class->send(netdev, qid, batch, >> concurrent_txq); >> if (!error) { >> COVERAGE_INC(netdev_sent); >> } >> @@ -878,9 +934,21 @@ netdev_push_header(const struct netdev *netdev, >> const struct ovs_action_push_tnl *data) >> { >> struct dp_packet *packet; >> - DP_PACKET_BATCH_FOR_EACH (i, packet, batch) { >> - netdev->netdev_class->push_header(netdev, packet, data); >> - pkt_metadata_init(&packet->md, data->out_port); >> + size_t i, size = dp_packet_batch_size(batch); >> + >> + DP_PACKET_BATCH_REFILL_FOR_EACH (i, size, packet, batch) { >> + if (OVS_UNLIKELY(dp_packet_hwol_is_tso(packet) >> + || dp_packet_hwol_l4_mask(packet))) { >> + COVERAGE_INC(netdev_push_header_drops); >> + dp_packet_delete(packet); >> + VLOG_WARN_RL(&rl, "%s: Tunneling packets with HW offload >> flags is " >> + "not supported: packet dropped", >> + netdev_get_name(netdev)); >> + } else { >> + netdev->netdev_class->push_header(netdev, packet, data); >> + pkt_metadata_init(&packet->md, data->out_port); >> + dp_packet_batch_refill(batch, packet, i); >> + } >> } >> return 0; >> diff --git a/lib/userspace-tso.c b/lib/userspace-tso.c >> new file mode 100644 >> index 000000000..6a4a0149b >> --- /dev/null >> +++ b/lib/userspace-tso.c >> @@ -0,0 +1,53 @@ >> +/* >> + * Copyright (c) 2020 Red Hat, Inc. >> + * >> + * Licensed under the Apache License, Version 2.0 (the "License"); >> + * you may not use this file except in compliance with the License. >> + * You may obtain a copy of the License at: >> + * >> + * http://www.apache.org/licenses/LICENSE-2.0 >> + * >> + * Unless required by applicable law or agreed to in writing, software >> + * distributed under the License is distributed on an "AS IS" BASIS, >> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or >> implied. >> + * See the License for the specific language governing permissions and >> + * limitations under the License. >> + */ >> + >> +#include <config.h> >> + >> +#include "smap.h" >> +#include "ovs-thread.h" >> +#include "openvswitch/vlog.h" >> +#include "dpdk.h" >> +#include "userspace-tso.h" >> +#include "vswitch-idl.h" >> + >> +VLOG_DEFINE_THIS_MODULE(userspace_tso); >> + >> +static bool userspace_tso = false; >> + >> +void >> +userspace_tso_init(const struct smap *ovs_other_config) >> +{ >> + if (smap_get_bool(ovs_other_config, "userspace-tso-enable", >> false)) { >> + static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER; >> + >> + if (ovsthread_once_start(&once)) { >> +#ifdef DPDK_NETDEV >> + VLOG_INFO("Userspace TCP Segmentation Offloading support >> enabled"); >> + userspace_tso = true; >> +#else >> + VLOG_WARN("Userspace TCP Segmentation Offloading can not >> be enabled" >> + "since OVS is built without DPDK support."); >> +#endif >> + ovsthread_once_done(&once); >> + } >> + } >> +} >> + >> +bool >> +userspace_tso_enabled(void) >> +{ >> + return userspace_tso; >> +} >> diff --git a/lib/userspace-tso.h b/lib/userspace-tso.h >> new file mode 100644 >> index 000000000..0758274c0 >> --- /dev/null >> +++ b/lib/userspace-tso.h >> @@ -0,0 +1,23 @@ >> +/* >> + * Copyright (c) 2020 Red Hat Inc. >> + * >> + * Licensed under the Apache License, Version 2.0 (the "License"); >> + * you may not use this file except in compliance with the License. >> + * You may obtain a copy of the License at: >> + * >> + * http://www.apache.org/licenses/LICENSE-2.0 >> + * >> + * Unless required by applicable law or agreed to in writing, software >> + * distributed under the License is distributed on an "AS IS" BASIS, >> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or >> implied. >> + * See the License for the specific language governing permissions and >> + * limitations under the License. >> + */ >> + >> +#ifndef USERSPACE_TSO_H >> +#define USERSPACE_TSO_H 1 >> + >> +void userspace_tso_init(const struct smap *ovs_other_config); >> +bool userspace_tso_enabled(void); >> + >> +#endif /* userspace-tso.h */ >> diff --git a/vswitchd/bridge.c b/vswitchd/bridge.c >> index 86c7b10a9..e591c26a6 100644 >> --- a/vswitchd/bridge.c >> +++ b/vswitchd/bridge.c >> @@ -65,6 +65,7 @@ >> #include "system-stats.h" >> #include "timeval.h" >> #include "tnl-ports.h" >> +#include "userspace-tso.h" >> #include "util.h" >> #include "unixctl.h" >> #include "lib/vswitch-idl.h" >> @@ -3285,6 +3286,7 @@ bridge_run(void) >> if (cfg) { >> netdev_set_flow_api_enabled(&cfg->other_config); >> dpdk_init(&cfg->other_config); >> + userspace_tso_init(&cfg->other_config); >> } >> /* Initialize the ofproto library. This only needs to run once, >> but >> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml >> index c43cb1aa4..3ddaaefda 100644 >> --- a/vswitchd/vswitch.xml >> +++ b/vswitchd/vswitch.xml >> @@ -690,6 +690,26 @@ >> once in few hours or a day or a week. >> </p> >> </column> >> + <column name="other_config" key="userspace-tso-enable" >> + type='{"type": "boolean"}'> >> + <p> >> + Set this value to <code>true</code> to enable userspace >> support for >> + TCP Segmentation Offloading (TSO). When it is enabled, the >> interfaces >> + can provide an oversized TCP segment to the datapath and >> the datapath >> + will offload the TCP segmentation and checksum calculation >> to the >> + interfaces when necessary. >> + </p> >> + <p> >> + The default value is <code>false</code>. Changing this >> value requires >> + restarting the daemon. >> + </p> >> + <p> >> + The feature only works if Open vSwitch is built with DPDK >> support. >> + </p> >> + <p> >> + The feature is considered experimental. >> + </p> >> + </column> >> </group> >> <group title="Status"> >> <column name="next_cfg"> >> _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
