On Tue, Jan 14, 2020 at 06:48:14PM +0100, Ilya Maximets wrote: > On 09.01.2020 15:44, Flavio Leitner wrote: > > Abbreviated as TSO, TCP Segmentation Offload is a feature which enables > > the network stack to delegate the TCP segmentation to the NIC reducing > > the per packet CPU overhead. > > > > A guest using vhostuser interface with TSO enabled can send TCP packets > > much bigger than the MTU, which saves CPU cycles normally used to break > > the packets down to MTU size and to calculate checksums. > > > > It also saves CPU cycles used to parse multiple packets/headers during > > the packet processing inside virtual switch. > > > > If the destination of the packet is another guest in the same host, then > > the same big packet can be sent through a vhostuser interface skipping > > the segmentation completely. However, if the destination is not local, > > the NIC hardware is instructed to do the TCP segmentation and checksum > > calculation. > > > > It is recommended to check if NIC hardware supports TSO before enabling > > the feature, which is off by default. For additional information please > > check the tso.rst document. > > > > Signed-off-by: Flavio Leitner <[email protected]> > > --- > > It seems this patch needs a rebase due to recvmmsg related changes in > netdev-linux.
Yeah, Ian requested that too. I am working on it. > I didn't check the sizes and offsets inside the code and didn't look > close to the features enabling on devices. Some comments inline. OK, thanks for this review anyways. > > Documentation/automake.mk | 1 + > > Documentation/topics/dpdk/index.rst | 1 + > > Documentation/topics/dpdk/tso.rst | 96 +++++++++ > > NEWS | 1 + > > lib/automake.mk | 2 + > > lib/conntrack.c | 29 ++- > > lib/dp-packet.h | 152 +++++++++++++- > > lib/ipf.c | 32 +-- > > lib/netdev-dpdk.c | 312 ++++++++++++++++++++++++---- > > lib/netdev-linux-private.h | 4 + > > lib/netdev-linux.c | 296 +++++++++++++++++++++++--- > > lib/netdev-provider.h | 10 + > > lib/netdev.c | 66 +++++- > > lib/tso.c | 54 +++++ > > lib/tso.h | 23 ++ > > vswitchd/bridge.c | 2 + > > vswitchd/vswitch.xml | 12 ++ > > 17 files changed, 1002 insertions(+), 91 deletions(-) > > create mode 100644 Documentation/topics/dpdk/tso.rst > > create mode 100644 lib/tso.c > > create mode 100644 lib/tso.h > > > > Changelog: > > - v3 > > * Improved the documentation. > > * Updated copyright year to 2020. > > * TSO offloaded msg now includes the netdev's name. > > * Added period at the end of all code comments. > > * Warn and drop encapsulation of TSO packets. > > * Fixed travis issue with restricted virtio types. > > * Fixed double headroom allocation in dpdk_copy_dp_packet_to_mbuf() > > which caused packet corruption. > > * Fixed netdev_dpdk_prep_hwol_packet() to unconditionally set > > PKT_TX_IP_CKSUM only for IPv4 packets. > > > > diff --git a/Documentation/automake.mk b/Documentation/automake.mk > > index f2ca17bad..284327edd 100644 > > --- a/Documentation/automake.mk > > +++ b/Documentation/automake.mk > > @@ -35,6 +35,7 @@ DOC_SOURCE = \ > > Documentation/topics/dpdk/index.rst \ > > Documentation/topics/dpdk/bridge.rst \ > > Documentation/topics/dpdk/jumbo-frames.rst \ > > + Documentation/topics/dpdk/tso.rst \ > > Documentation/topics/dpdk/memory.rst \ > > Documentation/topics/dpdk/pdump.rst \ > > Documentation/topics/dpdk/phy.rst \ > > diff --git a/Documentation/topics/dpdk/index.rst > > b/Documentation/topics/dpdk/index.rst > > index f2862ea70..400d56051 100644 > > --- a/Documentation/topics/dpdk/index.rst > > +++ b/Documentation/topics/dpdk/index.rst > > @@ -40,4 +40,5 @@ DPDK Support > > /topics/dpdk/qos > > /topics/dpdk/pdump > > /topics/dpdk/jumbo-frames > > + /topics/dpdk/tso > > /topics/dpdk/memory > > diff --git a/Documentation/topics/dpdk/tso.rst > > b/Documentation/topics/dpdk/tso.rst > > new file mode 100644 > > index 000000000..189c86480 > > --- /dev/null > > +++ b/Documentation/topics/dpdk/tso.rst > > @@ -0,0 +1,96 @@ > > +.. > > + Copyright 2020, Red Hat, Inc. > > + > > + Licensed under the Apache License, Version 2.0 (the "License"); you > > may > > + not use this file except in compliance with the License. You may > > obtain > > + a copy of the License at > > + > > + http://www.apache.org/licenses/LICENSE-2.0 > > + > > + Unless required by applicable law or agreed to in writing, software > > + distributed under the License is distributed on an "AS IS" BASIS, > > WITHOUT > > + WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See > > the > > + License for the specific language governing permissions and > > limitations > > + under the License. > > + > > + Convention for heading levels in Open vSwitch documentation: > > + > > + ======= Heading 0 (reserved for the title in a document) > > + ------- Heading 1 > > + ~~~~~~~ Heading 2 > > + +++++++ Heading 3 > > + ''''''' Heading 4 > > + > > + Avoid deeper levels because they do not render well. > > + > > +======================== > > +Userspace Datapath - TSO > > +======================== > > + > > +**Note:** This feature is considered experimental. > > + > > +TCP Segmentation Offload (TSO) enables a network stack to delegate > > segmentation > > +of an oversized TCP segment to the underlying physical NIC. Offload of > > frame > > +segmentation achieves computational savings in the core, freeing up CPU > > cycles > > +for more useful work. > > + > > +A common use case for TSO is when using virtualization, where traffic > > that's > > +coming in from a VM can offload the TCP segmentation, thus avoiding the > > +fragmentation in software. Additionally, if the traffic is headed to a VM > > +within the same host further optimization can be expected. As the traffic > > never > > +leaves the machine, no MTU needs to be accounted for, and thus no > > segmentation > > +and checksum calculations are required, which saves yet more cycles. Only > > when > > +the traffic actually leaves the host the segmentation needs to happen, in > > which > > +case it will be performed by the egress NIC. Consult your controller's > > +datasheet for compatibility. Secondly, the NIC must have an associated DPDK > > +Poll Mode Driver (PMD) which supports `TSO`. For a list of features per > > PMD, > > +refer to the `DPDK documentation`__. > > + > > +__ https://doc.dpdk.org/guides/nics/overview.html > > This should point to 19.11 version of a guide instead of latest one: > https://doc.dpdk.org/guides-19.11/nics/overview.html Sounds fair, ok. > > > + > > +Enabling TSO > > +~~~~~~~~~~~~ > > + > > +The TSO support may be enabled via a global config value ``tso-support``. > > +Setting this to ``true`` enables TSO support for all ports. > > + > > + $ ovs-vsctl set Open_vSwitch . other_config:tso-support=true > > > I'd suggest to rename this to 'userspace-tso-support' to avoid > misunderstanding. ok, I will rename. > > + > > +The default value is ``false``. > > + > > +Changing ``tso-support`` requires restarting the daemon. > > + > > +When using :doc:`vHost User ports <vhost-user>`, TSO may be enabled as > > follows. > > + > > +`TSO` is enabled in OvS by the DPDK vHost User backend; when a new guest > > +connection is established, `TSO` is thus advertised to the guest as an > > +available feature: > > + > > +QEMU Command Line Parameter:: > > + > > + $ sudo $QEMU_DIR/x86_64-softmmu/qemu-system-x86_64 \ > > + ... > > + -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,\ > > + csum=on,guest_csum=on,guest_tso4=on,guest_tso6=on\ > > + ... > > + > > +2. Ethtool. Assuming that the guest's OS also supports `TSO`, ethtool can > > be > > +used to enable same:: > > + > > + $ ethtool -K eth0 sg on # scatter-gather is a prerequisite for TSO > > + $ ethtool -K eth0 tso on > > + $ ethtool -k eth0 > > + > > +~~~~~~~~~~~ > > +Limitations > > +~~~~~~~~~~~ > > + > > +The current OvS userspace `TSO` implementation supports flat and VLAN > > networks > > +only (i.e. no support for `TSO` over tunneled connection [VxLAN, GRE, > > IPinIP, > > +etc.]). > > + > > +There is no software implementation of TSO, so all ports attached to the > > +datapath must support TSO or packets using that feature will be dropped > > +on ports without TSO support. That also means guests using vhost-user > > +in client mode will receive TSO packet regardless of TSO being enabled > > +or disabled within the guest. > > diff --git a/NEWS b/NEWS > > index 965facaf8..306c0493d 100644 > > --- a/NEWS > > +++ b/NEWS > > @@ -26,6 +26,7 @@ Post-v2.12.0 > > * DPDK ring ports (dpdkr) are deprecated and will be removed in next > > releases. > > * Add support for DPDK 19.11. > > + * Add experimental support for TSO. > > - RSTP: > > * The rstp_statistics column in Port table will only be updated every > > stats-update-interval configured in Open_vSwtich table. > > diff --git a/lib/automake.mk b/lib/automake.mk > > index ebf714501..94a1b4459 100644 > > --- a/lib/automake.mk > > +++ b/lib/automake.mk > > @@ -304,6 +304,8 @@ lib_libopenvswitch_la_SOURCES = \ > > lib/tnl-neigh-cache.h \ > > lib/tnl-ports.c \ > > lib/tnl-ports.h \ > > + lib/tso.c \ > > + lib/tso.h \ > > s/tso/userspace-tso/ Yup > > > lib/netdev-native-tnl.c \ > > lib/netdev-native-tnl.h \ > > lib/token-bucket.c \ > > diff --git a/lib/conntrack.c b/lib/conntrack.c > > index b80080e72..679054b98 100644 > > --- a/lib/conntrack.c > > +++ b/lib/conntrack.c > > @@ -2022,7 +2022,8 @@ conn_key_extract(struct conntrack *ct, struct > > dp_packet *pkt, ovs_be16 dl_type, > > if (hwol_bad_l3_csum) { > > ok = false; > > } else { > > - bool hwol_good_l3_csum = dp_packet_ip_checksum_valid(pkt); > > + bool hwol_good_l3_csum = dp_packet_ip_checksum_valid(pkt) > > + || dp_packet_hwol_tx_ip_checksum(pkt); > > /* Validate the checksum only when hwol is not supported. */ > > ok = extract_l3_ipv4(&ctx->key, l3, dp_packet_l3_size(pkt), > > NULL, > > !hwol_good_l3_csum); > > @@ -2036,7 +2037,8 @@ conn_key_extract(struct conntrack *ct, struct > > dp_packet *pkt, ovs_be16 dl_type, > > if (ok) { > > bool hwol_bad_l4_csum = dp_packet_l4_checksum_bad(pkt); > > if (!hwol_bad_l4_csum) { > > - bool hwol_good_l4_csum = dp_packet_l4_checksum_valid(pkt); > > + bool hwol_good_l4_csum = dp_packet_l4_checksum_valid(pkt) > > + || > > dp_packet_hwol_tx_l4_checksum(pkt); > > /* Validate the checksum only when hwol is not supported. */ > > if (extract_l4(&ctx->key, l4, dp_packet_l4_size(pkt), > > &ctx->icmp_related, l3, !hwol_good_l4_csum, > > @@ -3237,8 +3239,11 @@ handle_ftp_ctl(struct conntrack *ct, const struct > > conn_lookup_ctx *ctx, > > } > > if (seq_skew) { > > ip_len = ntohs(l3_hdr->ip_tot_len) + seq_skew; > > - l3_hdr->ip_csum = recalc_csum16(l3_hdr->ip_csum, > > - l3_hdr->ip_tot_len, > > htons(ip_len)); > > + if (!dp_packet_hwol_tx_ip_checksum(pkt)) { > > + l3_hdr->ip_csum = recalc_csum16(l3_hdr->ip_csum, > > + l3_hdr->ip_tot_len, > > + htons(ip_len)); > > + } > > l3_hdr->ip_tot_len = htons(ip_len); > > } > > } > > @@ -3256,13 +3261,15 @@ handle_ftp_ctl(struct conntrack *ct, const struct > > conn_lookup_ctx *ctx, > > } > > > > th->tcp_csum = 0; > > - if (ctx->key.dl_type == htons(ETH_TYPE_IPV6)) { > > - th->tcp_csum = packet_csum_upperlayer6(nh6, th, ctx->key.nw_proto, > > - dp_packet_l4_size(pkt)); > > - } else { > > - uint32_t tcp_csum = packet_csum_pseudoheader(l3_hdr); > > - th->tcp_csum = csum_finish( > > - csum_continue(tcp_csum, th, dp_packet_l4_size(pkt))); > > + if (!dp_packet_hwol_tx_l4_checksum(pkt)) { > > + if (ctx->key.dl_type == htons(ETH_TYPE_IPV6)) { > > + th->tcp_csum = packet_csum_upperlayer6(nh6, th, > > ctx->key.nw_proto, > > + dp_packet_l4_size(pkt)); > > + } else { > > + uint32_t tcp_csum = packet_csum_pseudoheader(l3_hdr); > > + th->tcp_csum = csum_finish( > > + csum_continue(tcp_csum, th, dp_packet_l4_size(pkt))); > > + } > > } > > > > if (seq_skew) { > > diff --git a/lib/dp-packet.h b/lib/dp-packet.h > > index 133942155..d10a0416e 100644 > > --- a/lib/dp-packet.h > > +++ b/lib/dp-packet.h > > @@ -114,6 +114,8 @@ static inline void dp_packet_set_size(struct dp_packet > > *, uint32_t); > > static inline uint16_t dp_packet_get_allocated(const struct dp_packet *); > > static inline void dp_packet_set_allocated(struct dp_packet *, uint16_t); > > > > +void dp_packet_prepend_vnet_hdr(struct dp_packet *, int mtu); > > No such function. Good catch, it got moved to netdev_linux_prepend_vnet_hdr(). > > + > > void *dp_packet_resize_l2(struct dp_packet *, int increment); > > void *dp_packet_resize_l2_5(struct dp_packet *, int increment); > > static inline void *dp_packet_eth(const struct dp_packet *); > > @@ -456,7 +458,7 @@ dp_packet_init_specific(struct dp_packet *p) > > { > > /* This initialization is needed for packets that do not come from DPDK > > * interfaces, when vswitchd is built with --with-dpdk. */ > > - p->mbuf.tx_offload = p->mbuf.packet_type = 0; > > + p->mbuf.ol_flags = p->mbuf.tx_offload = p->mbuf.packet_type = 0; > > p->mbuf.nb_segs = 1; > > p->mbuf.next = NULL; > > } > > @@ -519,6 +521,80 @@ dp_packet_set_allocated(struct dp_packet *b, uint16_t > > s) > > b->mbuf.buf_len = s; > > } > > > > +static inline bool > > +dp_packet_hwol_is_tso(const struct dp_packet *b) > > +{ > > + return (b->mbuf.ol_flags & (PKT_TX_TCP_SEG | PKT_TX_L4_MASK)) > > + ? true > > + : false; > > Usual way for converting to bool is to use '!!'. This will save some space. Ok will change that and the other similar cases. > > +static inline void > > +dp_packet_hwol_set_tx_ipv4(struct dp_packet *b) { > > '{' should be on the next line. Same for all the functions below. > > And some comments to these functions would be nice. At least a single > comment for a group of functions. Ok. > Some comments for below functions too. Ok. > > > +static inline bool > > +dp_packet_hwol_tx_ip_checksum(const struct dp_packet *p) > > +{ > > + > > + return dp_packet_hwol_l4_mask(p) ? true : false; > > '!!'? > Also, it seems strange to check l4 offloading mask to check if > we have ip checksum. Shouldn't we check for PKT_TX_IPV4/6 > instead? This might not work for pure IP packets (without L4). I remember there were cases where the flag was not set. I don't have my notes handy now, but I will double check later. [...] > > @@ -2097,11 +2184,22 @@ netdev_dpdk_eth_tx_burst(struct netdev_dpdk *dev, > > int qid, > > struct rte_mbuf **pkts, int cnt) > > { > > uint32_t nb_tx = 0; > > + uint16_t nb_tx_prep = cnt; > > + > > + if (tso_enabled()) { > > + nb_tx_prep = rte_eth_tx_prepare(dev->port_id, qid, pkts, cnt); > > Packets dropped and not counted. It returns cnt (total) - nb_tx (transmitted), which the caller will account as tx failures. > > > + if (nb_tx_prep != cnt) { > > + VLOG_WARN_RL(&rl, "%s: Output batch contains invalid packets. " > > + "Only %u/%u are valid: %s", dev->up.name, > > nb_tx_prep, > > + cnt, rte_strerror(rte_errno)); > > + } > > + } > > > > - while (nb_tx != cnt) { > > + while (nb_tx != nb_tx_prep) { > > uint32_t ret; > > > > - ret = rte_eth_tx_burst(dev->port_id, qid, pkts + nb_tx, cnt - > > nb_tx); > > + ret = rte_eth_tx_burst(dev->port_id, qid, pkts + nb_tx, > > + nb_tx_prep - nb_tx); > > if (!ret) { > > break; > > } > > @@ -2386,11 +2484,14 @@ netdev_dpdk_filter_packet_len(struct netdev_dpdk > > *dev, struct rte_mbuf **pkts, > > int cnt = 0; > > struct rte_mbuf *pkt; > > > > + /* Filter oversized packets, unless are marked for TSO. */ > > for (i = 0; i < pkt_cnt; i++) { > > pkt = pkts[i]; > > - if (OVS_UNLIKELY(pkt->pkt_len > dev->max_packet_len)) { > > - VLOG_WARN_RL(&rl, "%s: Too big size %" PRIu32 " max_packet_len > > %d", > > - dev->up.name, pkt->pkt_len, dev->max_packet_len); > > + if (OVS_UNLIKELY((pkt->pkt_len > dev->max_packet_len) > > + && !(pkt->ol_flags & PKT_TX_TCP_SEG))) { > > + VLOG_WARN_RL(&rl, "%s: Too big size %" PRIu32 " " > > + "max_packet_len %d", dev->up.name, pkt->pkt_len, > > + dev->max_packet_len); > > rte_pktmbuf_free(pkt); > > continue; > > } > > @@ -2442,7 +2543,7 @@ __netdev_dpdk_vhost_send(struct netdev *netdev, int > > qid, > > struct rte_mbuf **cur_pkts = (struct rte_mbuf **) pkts; > > struct netdev_dpdk_sw_stats sw_stats_add; > > unsigned int n_packets_to_free = cnt; > > - unsigned int total_packets = cnt; > > + unsigned int total_packets; > > int i, retries = 0; > > int max_retries = VHOST_ENQ_RETRY_MIN; > > int vid = netdev_dpdk_get_vid(dev); > > @@ -2462,7 +2563,8 @@ __netdev_dpdk_vhost_send(struct netdev *netdev, int > > qid, > > rte_spinlock_lock(&dev->tx_q[qid].tx_lock); > > } > > > > - cnt = netdev_dpdk_filter_packet_len(dev, cur_pkts, cnt); > > + total_packets = netdev_dpdk_prep_hwol_batch(dev, cur_pkts, cnt); > > lib/daemon-unix.c Have you checked the performance > impact for non-TSO setup? I did a light test and noticed no difference. It only checks for two flags in the same field for the non-TSO case. > > > + cnt = netdev_dpdk_filter_packet_len(dev, cur_pkts, total_packets); > > sw_stats_add.tx_mtu_exceeded_drops = total_packets - cnt; > > > > /* Check has QoS has been configured for the netdev */ > > @@ -2511,6 +2613,121 @@ out: > > } > > } > > > > +static void > > +netdev_dpdk_extbuf_free(void *addr OVS_UNUSED, void *opaque) > > +{ > > + rte_free(opaque); > > +} > > + > > +static struct rte_mbuf * > > +dpdk_pktmbuf_attach_extbuf(struct rte_mbuf *pkt, uint32_t data_len) > > +{ > > + uint32_t total_len = RTE_PKTMBUF_HEADROOM + data_len; > > + struct rte_mbuf_ext_shared_info *shinfo = NULL; > > + uint16_t buf_len; > > + void *buf; > > + > > + if (rte_pktmbuf_tailroom(pkt) >= sizeof(*shinfo)) { > > + shinfo = rte_pktmbuf_mtod(pkt, struct rte_mbuf_ext_shared_info *); > > + } else { > > + total_len += sizeof(*shinfo) + sizeof(uintptr_t); > > + total_len = RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t)); > > + } > > + > > + if (unlikely(total_len > UINT16_MAX)) { > > + VLOG_ERR("Can't copy packet: too big %u", total_len); > > + return NULL; > > + } > > + > > + buf_len = total_len; > > + buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE); > > + if (unlikely(buf == NULL)) { > > + VLOG_ERR("Failed to allocate memory using rte_malloc: %u", > > buf_len); > > + return NULL; > > + } > > + > > + /* Initialize shinfo. */ > > + if (shinfo) { > > + shinfo->free_cb = netdev_dpdk_extbuf_free; > > + shinfo->fcb_opaque = buf; > > + rte_mbuf_ext_refcnt_set(shinfo, 1); > > + } else { > > + shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len, > > + > > netdev_dpdk_extbuf_free, > > + buf); > > + if (unlikely(shinfo == NULL)) { > > + rte_free(buf); > > + VLOG_ERR("Failed to initialize shared info for mbuf while " > > + "attempting to attach an external buffer."); > > + return NULL; > > + } > > + } > > + > > + rte_pktmbuf_attach_extbuf(pkt, buf, rte_malloc_virt2iova(buf), buf_len, > > + shinfo); > > + rte_pktmbuf_reset_headroom(pkt); > > + > > + return pkt; > > +} > > + > > +static struct rte_mbuf * > > +dpdk_pktmbuf_alloc(struct rte_mempool *mp, uint32_t data_len) > > +{ > > + struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp); > > + > > + if (OVS_UNLIKELY(!pkt)) { > > + return NULL; > > + } > > + > > + dp_packet_init_specific((struct dp_packet *)pkt); > > + if (rte_pktmbuf_tailroom(pkt) >= data_len) { > > + return pkt; > > + } > > + > > + if (dpdk_pktmbuf_attach_extbuf(pkt, data_len)) { > > + return pkt; > > + } > > + > > + rte_pktmbuf_free(pkt); > > + > > + return NULL; > > +} > > + > > +static struct dp_packet * > > +dpdk_copy_dp_packet_to_mbuf(struct rte_mempool *mp, struct dp_packet > > *pkt_orig) > > +{ > > + struct rte_mbuf *mbuf_dest; > > + struct dp_packet *pkt_dest; > > + uint32_t pkt_len; > > + > > + pkt_len = dp_packet_size(pkt_orig); > > + mbuf_dest = dpdk_pktmbuf_alloc(mp, pkt_len); > > + if (OVS_UNLIKELY(mbuf_dest == NULL)) { > > + return NULL; > > + } > > + > > + pkt_dest = CONTAINER_OF(mbuf_dest, struct dp_packet, mbuf); > > + memcpy(dp_packet_data(pkt_dest), dp_packet_data(pkt_orig), pkt_len); > > + dp_packet_set_size(pkt_dest, pkt_len); > > + > > + mbuf_dest->tx_offload = pkt_orig->mbuf.tx_offload; > > + mbuf_dest->packet_type = pkt_orig->mbuf.packet_type; > > + mbuf_dest->ol_flags |= (pkt_orig->mbuf.ol_flags & > > + ~(EXT_ATTACHED_MBUF | IND_ATTACHED_MBUF)); > > + > > + memcpy(&pkt_dest->l2_pad_size, &pkt_orig->l2_pad_size, > > + sizeof(struct dp_packet) - offsetof(struct dp_packet, > > l2_pad_size)); > > + > > + if (mbuf_dest->ol_flags & PKT_TX_L4_MASK) { > > + mbuf_dest->l2_len = (char *)dp_packet_l3(pkt_dest) > > + - (char *)dp_packet_eth(pkt_dest); > > + mbuf_dest->l3_len = (char *)dp_packet_l4(pkt_dest) > > + - (char *) dp_packet_l3(pkt_dest); > > + } > > + > > + return pkt_dest; > > +} > > + > > /* Tx function. Transmit packets indefinitely */ > > static void > > dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch > > *batch) > > @@ -2524,7 +2741,7 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid, > > struct dp_packet_batch *batch) > > enum { PKT_ARRAY_SIZE = NETDEV_MAX_BURST }; > > #endif > > struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); > > - struct rte_mbuf *pkts[PKT_ARRAY_SIZE]; > > + struct dp_packet *pkts[PKT_ARRAY_SIZE]; > > struct netdev_dpdk_sw_stats *sw_stats = dev->sw_stats; > > uint32_t cnt = batch_cnt; > > uint32_t dropped = 0; > > @@ -2545,34 +2762,30 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid, > > struct dp_packet_batch *batch) > > struct dp_packet *packet = batch->packets[i]; > > uint32_t size = dp_packet_size(packet); > > > > - if (OVS_UNLIKELY(size > dev->max_packet_len)) { > > - VLOG_WARN_RL(&rl, "Too big size %u max_packet_len %d", > > - size, dev->max_packet_len); > > - > > + if (size > dev->max_packet_len > > + && !(packet->mbuf.ol_flags & PKT_TX_TCP_SEG)) { > > + VLOG_WARN_RL(&rl, "Too big size %u max_packet_len %d", size, > > + dev->max_packet_len); > > mtu_drops++; > > continue; > > } > > > > - pkts[txcnt] = rte_pktmbuf_alloc(dev->dpdk_mp->mp); > > + pkts[txcnt] = dpdk_copy_dp_packet_to_mbuf(dev->dpdk_mp->mp, > > packet); > > if (OVS_UNLIKELY(!pkts[txcnt])) { > > dropped = cnt - i; > > break; > > } > > > > - /* We have to do a copy for now */ > > - memcpy(rte_pktmbuf_mtod(pkts[txcnt], void *), > > - dp_packet_data(packet), size); > > - dp_packet_set_size((struct dp_packet *)pkts[txcnt], size); > > - > > txcnt++; > > } > > > > if (OVS_LIKELY(txcnt)) { > > if (dev->type == DPDK_DEV_VHOST) { > > - __netdev_dpdk_vhost_send(netdev, qid, (struct dp_packet **) > > pkts, > > - txcnt); > > + __netdev_dpdk_vhost_send(netdev, qid, pkts, txcnt); > > } else { > > - tx_failure = netdev_dpdk_eth_tx_burst(dev, qid, pkts, txcnt); > > + tx_failure += netdev_dpdk_eth_tx_burst(dev, qid, > > + (struct rte_mbuf > > **)pkts, > > + txcnt); > > } > > } > > > > @@ -2630,6 +2843,7 @@ netdev_dpdk_send__(struct netdev_dpdk *dev, int qid, > > int batch_cnt = dp_packet_batch_size(batch); > > struct rte_mbuf **pkts = (struct rte_mbuf **) batch->packets; > > > > + batch_cnt = netdev_dpdk_prep_hwol_batch(dev, pkts, batch_cnt); > > Packets dropped and not counted. Right. > Also, this function called unconditionally. Perfomance impact on non-TSO case? It's the same as above. > > tx_cnt = netdev_dpdk_filter_packet_len(dev, pkts, batch_cnt); > > mtu_drops = batch_cnt - tx_cnt; > > qos_drops = tx_cnt; [...] > > diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h > > index f08159aa7..102548db7 100644 > > --- a/lib/netdev-linux-private.h > > +++ b/lib/netdev-linux-private.h > > @@ -37,10 +37,14 @@ > > > > struct netdev; > > > > +#define LINUX_RXQ_TSO_MAX_LEN 65536 > > + > > struct netdev_rxq_linux { > > struct netdev_rxq up; > > bool is_tap; > > int fd; > > + char *bufaux; /* Extra buffer to recv TSO pkt. */ > > + int bufaux_len; /* Extra buffer length. */ > > Length never changes. Why we need 'bufaux_len' ? I have no strong opinion, so either way is fine by me. [...] > > @@ -1024,6 +1040,13 @@ static struct netdev_rxq * > > netdev_linux_rxq_alloc(void) > > { > > struct netdev_rxq_linux *rx = xzalloc(sizeof *rx); > > + if (tso_enabled()) { > > + rx->bufaux = xmalloc(LINUX_RXQ_TSO_MAX_LEN); > > + if (rx->bufaux) { > > xmalloc can not fail. Old habits, will fix that. > > + rx->bufaux_len = LINUX_RXQ_TSO_MAX_LEN; > > + } > > + } > > + > > return &rx->up; > > } > > > > @@ -1069,6 +1092,17 @@ netdev_linux_rxq_construct(struct netdev_rxq *rxq_) > > goto error; > > } > > > > + if (tso_enabled()) { > > + error = setsockopt(rx->fd, SOL_PACKET, PACKET_VNET_HDR, &val, > > + sizeof val); > > + if (error) { > > You're not using the 'error'. Make it just "if (setsockopt()) {". I agree. > > > + error = errno; > > + VLOG_ERR("%s: failed to enable vnet hdr in txq raw socket: > > %s", > > + netdev_get_name(netdev_), ovs_strerror(errno)); > > + goto error; > > + } > > + } > > + > > /* Set non-blocking mode. */ > > error = set_nonblocking(rx->fd); > > if (error) { > > @@ -6173,6 +6275,19 @@ af_packet_sock(void) > > close(sock); > > sock = -error; > > } > > + > > + if (tso_enabled()) { > > + int val = 1; > > + error = setsockopt(sock, SOL_PACKET, PACKET_VNET_HDR, &val, > > + sizeof val); > > socket might be already closed here and there will be double close. Good catch, thanks. [...] > > @@ -51,6 +57,10 @@ struct netdev { > > * opening this device, and therefore got assigned to the "system" > > class */ > > bool auto_classified; > > > > + /* This bitmask of the offloading features enabled/supported by the > > + * supported by the netdev. */ > > So, enabled or supported? Please, choose one. I see, will fix that. > > > + uint64_t ol_flags; > > + > > /* If this is 'true', the user explicitly specified an MTU for this > > * netdev. Otherwise, Open vSwitch is allowed to override it. */ > > bool mtu_user_config; > > diff --git a/lib/netdev.c b/lib/netdev.c > > index 405c98c68..998525875 100644 > > --- a/lib/netdev.c > > +++ b/lib/netdev.c > > @@ -782,6 +782,52 @@ netdev_get_pt_mode(const struct netdev *netdev) > > : NETDEV_PT_LEGACY_L2); > > } > > > > +/* Check if a 'packet' is compatible with 'netdev_flags'. > > + * If a packet is incompatible, return 'false' with the 'errormsg' > > + * pointing to a reason. */ > > +static bool > > +netdev_send_prepare_packet(const uint64_t netdev_flags, > > + struct dp_packet *packet, char **errormsg) > > It's better to use VLOG_ERR_BUF instead of passing char **. > Not an OVS style. Sounds good to me. > > +netdev_send_prepare_batch(const struct netdev *netdev, > > + struct dp_packet_batch *batch) > > +{ > > + struct dp_packet *packet; > > + size_t i, size = dp_packet_batch_size(batch); > > + > > + DP_PACKET_BATCH_REFILL_FOR_EACH (i, size, packet, batch) { > > + char *errormsg = NULL; > > + > > + if (netdev_send_prepare_packet(netdev->ol_flags, packet, > > &errormsg)) { > > + dp_packet_batch_refill(batch, packet, i); > > + } else { > > + VLOG_WARN_RL(&rl, "%s: Packet dropped: %s", > > + errormsg ? errormsg : "Unsupported feature", > > + netdev_get_name(netdev)); > > Hmm, it seems that packet will be leaked here. > > Also, you're dropping packet without accounting it anyhow. We merged few > patches about counting dropped packets recently, so, now we should add new > counters for each dropping case in order to keep things consistent. Right, will fix that too. > > + } > > + } > > +} > > + > > /* Sends 'batch' on 'netdev'. Returns 0 if successful (for every packet), > > * otherwise a positive errno value. Returns EAGAIN without blocking if > > * at least one the packets cannot be queued immediately. Returns EMSGSIZE > > @@ -811,8 +857,10 @@ int > > netdev_send(struct netdev *netdev, int qid, struct dp_packet_batch *batch, > > bool concurrent_txq) > > { > > - int error = netdev->netdev_class->send(netdev, qid, batch, > > - concurrent_txq); > > + int error; > > + > > + netdev_send_prepare_batch(netdev, batch); > > send() doesn't expect empty batches. You need to check before calling it. OK. > > + error = netdev->netdev_class->send(netdev, qid, batch, concurrent_txq); > > if (!error) { > > COVERAGE_INC(netdev_sent); > > } > > @@ -878,9 +926,17 @@ netdev_push_header(const struct netdev *netdev, > > const struct ovs_action_push_tnl *data) > > { > > struct dp_packet *packet; > > - DP_PACKET_BATCH_FOR_EACH (i, packet, batch) { > > - netdev->netdev_class->push_header(netdev, packet, data); > > - pkt_metadata_init(&packet->md, data->out_port); > > + size_t i, size = dp_packet_batch_size(batch); > > + > > + DP_PACKET_BATCH_REFILL_FOR_EACH (i, size, packet, batch) { > > + if (!dp_packet_hwol_is_tso(packet)) { > > + netdev->netdev_class->push_header(netdev, packet, data); > > + pkt_metadata_init(&packet->md, data->out_port); > > + dp_packet_batch_refill(batch, packet, i); > > + } else { > > + VLOG_WARN_RL(&rl, "%s: Tunneling of TSO packet is not > > supported: " > > + "packet dropped", netdev_get_name(netdev)); > > Packet leaked. Drop not counted. Yeah, same as above. > > + } > > } > > > > return 0; > > diff --git a/lib/tso.c b/lib/tso.c > > new file mode 100644 > > index 000000000..9dc15e146 > > --- /dev/null > > +++ b/lib/tso.c > > @@ -0,0 +1,54 @@ > > +/* > > + * Copyright (c) 2020 Red Hat, Inc. > > + * > > + * Licensed under the Apache License, Version 2.0 (the "License"); > > + * you may not use this file except in compliance with the License. > > + * You may obtain a copy of the License at: > > + * > > + * http://www.apache.org/licenses/LICENSE-2.0 > > + * > > + * Unless required by applicable law or agreed to in writing, software > > + * distributed under the License is distributed on an "AS IS" BASIS, > > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. > > + * See the License for the specific language governing permissions and > > + * limitations under the License. > > + */ > > + > > +#include <config.h> > > + > > +#include "smap.h" > > +#include "ovs-thread.h" > > +#include "openvswitch/vlog.h" > > +#include "dpdk.h" > > +#include "tso.h" > > +#include "vswitch-idl.h" > > + > > +VLOG_DEFINE_THIS_MODULE(tso); > > userspace_tso > > > + > > +static bool tso_support_enabled = false; > > + > > +void > > +tso_init(const struct smap *ovs_other_config) > > userspace_tso_init > > > +{ > > + if (smap_get_bool(ovs_other_config, "tso-support", false)) { > > + static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER; > > + > > + if (ovsthread_once_start(&once)) { > > + if (dpdk_available()) { > > + VLOG_INFO("TCP Segmentation Offloading (TSO) support > > enabled"); > > + tso_support_enabled = true; > > + } else { > > + VLOG_ERR("TCP Segmentation Offloading (TSO) is unsupported > > " > > + "without enabling DPDK"); > > This is an artificial restriction. But we may remove it later. > > > + tso_support_enabled = false; > > + } > > + ovsthread_once_done(&once); > > + } > > + } > > +} > > + > > +bool > > +tso_enabled(void) > > userspace_tso_enabled() > > > +{ > > + return tso_support_enabled; > > +} > > diff --git a/lib/tso.h b/lib/tso.h > > new file mode 100644 > > index 000000000..6594496ac > > --- /dev/null > > +++ b/lib/tso.h > > @@ -0,0 +1,23 @@ > > +/* > > + * Copyright (c) 2020 Red Hat Inc. > > + * > > + * Licensed under the Apache License, Version 2.0 (the "License"); > > + * you may not use this file except in compliance with the License. > > + * You may obtain a copy of the License at: > > + * > > + * http://www.apache.org/licenses/LICENSE-2.0 > > + * > > + * Unless required by applicable law or agreed to in writing, software > > + * distributed under the License is distributed on an "AS IS" BASIS, > > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. > > + * See the License for the specific language governing permissions and > > + * limitations under the License. > > + */ > > + > > +#ifndef TSO_H > > +#define TSO_H 1 > > + > > +void tso_init(const struct smap *ovs_other_config); > > +bool tso_enabled(void); > > + > > +#endif /* tso.h */ > > diff --git a/vswitchd/bridge.c b/vswitchd/bridge.c > > index 86c7b10a9..6d73922f6 100644 > > --- a/vswitchd/bridge.c > > +++ b/vswitchd/bridge.c > > @@ -65,6 +65,7 @@ > > #include "system-stats.h" > > #include "timeval.h" > > #include "tnl-ports.h" > > +#include "tso.h" > > #include "util.h" > > #include "unixctl.h" > > #include "lib/vswitch-idl.h" > > @@ -3285,6 +3286,7 @@ bridge_run(void) > > if (cfg) { > > netdev_set_flow_api_enabled(&cfg->other_config); > > dpdk_init(&cfg->other_config); > > + tso_init(&cfg->other_config); > > } > > > > /* Initialize the ofproto library. This only needs to run once, but > > diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml > > index 0ec726c39..354dcabfa 100644 > > --- a/vswitchd/vswitch.xml > > +++ b/vswitchd/vswitch.xml > > @@ -690,6 +690,18 @@ > > once in few hours or a day or a week. > > </p> > > </column> > > + <column name="other_config" key="tso-support" > > + type='{"type": "boolean"}'> > > + <p> > > + Set this value to <code>true</code> to enable support for TSO > > (TCP > > + Segmentation Offloading). When TSO is enabled, vhost-user client > > Why we're talking about vhost-user interfaces here? Physical DPDK ports and > netdev-linux will be able too. Sounds strange. Yeah, that comment can be improved now. Thanks Ilya! -- fbl _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
