Flavio, thank you so much for clarification, I'll push "Enable VXLAN TSO for 
DPDK datapath" first, replies for your comments inline, please check them in 
the later part.

-----邮件原件-----
发件人: dev [mailto:ovs-dev-boun...@openvswitch.org] 代表 Flavio Leitner
发送时间: 2020年11月24日 3:10
收件人: yang_y_yi <yang_y...@163.com>
抄送: ovs-dev@openvswitch.org
主题: Re: [ovs-dev] [PATCH V3 1/4] Enable VXLAN TSO for DPDK datapath


Hi Yi,

On Mon, Nov 02, 2020 at 11:16:49AM +0800, yang_y_yi wrote:
> 
> 
> Thanks a lot, Flavio, please check inline comments for more discussion.
> 
> 
> 
> At 2020-10-31 01:55:57, "Flavio Leitner" <f...@sysclose.org> wrote:
> >
> >Hi Yi,
> >
> >Thanks for the patch and sorry the delay to review it.
> >See my comments in line.
> >
> >Thanks,
> >fbl
> >
> >
> >On Fri, Aug 07, 2020 at 06:56:45PM +0800, yang_y...@163.com wrote:
> >> From: Yi Yang <yangy...@inspur.com>
> >> 
> >> Many NICs can support VXLAN TSO which can help improve 
> >> across-compute-node VM-to-VM performance in case that MTU is set to 
> >> 1500.
> >> 
> >> This patch allows dpdkvhostuserclient interface and veth/tap 
> >> interface to leverage NICs' offload capability to maximize 
> >> across-compute-node TCP performance, with it applied, OVS DPDK can 
> >> reach linespeed for across-compute-node VM-to-VM TCP performance.
> >> 
> >> Signed-off-by: Yi Yang <yangy...@inspur.com>
> >> ---
> >>  lib/dp-packet.h       |  76 ++++++++++++++++++++
> >>  lib/netdev-dpdk.c     | 188 
> >> ++++++++++++++++++++++++++++++++++++++++++++++----
> >>  lib/netdev-linux.c    |  20 ++++++
> >>  lib/netdev-provider.h |   1 +
> >>  lib/netdev.c          |  69 ++++++++++++++++--
> >>  5 files changed, 338 insertions(+), 16 deletions(-)
> >> 
> >> diff --git a/lib/dp-packet.h b/lib/dp-packet.h index 
> >> 0430cca..79895f2 100644
> >> --- a/lib/dp-packet.h
> >> +++ b/lib/dp-packet.h
> >> @@ -81,6 +81,8 @@ enum dp_packet_offload_mask {
> >>      DEF_OL_FLAG(DP_PACKET_OL_TX_UDP_CKSUM, PKT_TX_UDP_CKSUM, 0x400),
> >>      /* Offload SCTP checksum. */
> >>      DEF_OL_FLAG(DP_PACKET_OL_TX_SCTP_CKSUM, PKT_TX_SCTP_CKSUM, 
> >> 0x800),
> >> +    /* VXLAN TCP Segmentation Offload. */
> >> +    DEF_OL_FLAG(DP_PACKET_OL_TX_TUNNEL_VXLAN, PKT_TX_TUNNEL_VXLAN, 
> >> + 0x1000),
> >>      /* Adding new field requires adding to 
> >> DP_PACKET_OL_SUPPORTED_MASK. */  };
> >>  
> >> @@ -1032,6 +1034,80 @@ dp_packet_hwol_set_tcp_seg(struct dp_packet *b)
> >>      *dp_packet_ol_flags_ptr(b) |= DP_PACKET_OL_TX_TCP_SEG;  }
> >>  
> >> +#ifdef DPDK_NETDEV
> >> +/* Mark packet 'b' for VXLAN TCP segmentation offloading. */ 
> >> +static inline void dp_packet_hwol_set_vxlan_tcp_seg(struct 
> >> +dp_packet *b) {
> >> +    b->mbuf.ol_flags |= DP_PACKET_OL_TX_TUNNEL_VXLAN;
> >> +    b->mbuf.l2_len += sizeof(struct udp_header) +
> >> +                      sizeof(struct vxlanhdr);
> >
> >
> >What about L3 length?
> 
> For tunnel offload, l2_len must be original l2_len plus vxlan and udp header 
> size, l3_len is still be inner l3_len.

Ok, I see now that the patch requires DEV_TX_OFFLOAD_OUTER_IPV4_CKSUM. 
That feature is not very common, but might be fine to start with it,
and if needed add extra support for segmenting inner TCP only.

[Yi Yang] all the NICS which support TUNNEL offload can support 
DEV_TX_OFFLOAD_OUTER_IPV4_CKSUM, DEV_TX_OFFLOAD_OUTER_IPV4_CKSUM will not be 
used if no TUNNEL offload feature is available. For TUNNEL TCP segmenting, it 
is also necessary, so it is a must-have feature if it can support TUNNEL 
offload.

> >
> >> +    b->mbuf.outer_l2_len = ETH_HEADER_LEN;
> >> +    b->mbuf.outer_l3_len = IP_HEADER_LEN;
> >
> >What about IPv6?
> 
> Good catch, we need to care outer ipv6 case. I'll split it to a single 
> function dp_packet_hwol_set_outer_l2_len & dp_packet_hwol_set_l3_len to 
> handle this.

Ok.

> >
> >> +}
> >> +
> >> +/* Check if it is a VXLAN packet */
> >> +static inline bool
> >> +dp_packet_hwol_is_vxlan_tcp_seg(struct dp_packet *b)
> >> +{
> >> +    return (b->mbuf.ol_flags & DP_PACKET_OL_TX_TUNNEL_VXLAN);
> >
> >
> >Please use dp_packet_ol_flags_ptr()
> 
> Ok, will use use dp_packet_ol_flags_ptr() in new version.
> 
> >
> >> +}
> >> +
> >> +/* Set l2_len for the packet 'b' */
> >> +static inline void
> >> +dp_packet_hwol_set_l2_len(struct dp_packet *b, int l2_len)
> >> +{
> >> +    b->mbuf.l2_len = l2_len;
> >> +}
> >
> >This function is only called by Linux in the ingress
> >path before the data processing, so it shouldn't set
> >any value other than the ones related to the iface
> >offloading at this point. Also that the data path can
> >change the packet and there is nothing to update
> >those lengths.
> 
> Does "Linux" mean "system interfaces"?, we need to use l2_len, but I saw 
> l2_len isn't set in some cases, so added this function.

Yes, system interfaces. See more details below.


> >In the egress path it calls netdev_dpdk_prep_hwol_packet()
> >to update those fields.
> 
> If output port is system interfaces (veth or tap), 
> netdev_dpdk_prep_hwol_packet() won't be called.

If the mbuf needs to be copied then you're right.
That's a bug which needs a separate patch. Let me know
if you want to do it, or if I could do it. I have a
refactoring in progress, but either way is fine by me.

[Yi Yang] I'll send " Enable VXLAN TSO for DPDK datapath" separately, that 
should be acceptable to you reviewer, I know there are some disputes about 
multi-segmented mbuf, I'll send GRO and GSO patches after " Enable VXLAN TSO 
for DPDK datapath", this seems a better way to go forward.

Please continue your refactoring work, I'll fix merge conflict by then.

> >> +
> >> +/* Set l3_len for the packet 'b' */
> >> +static inline void
> >> +dp_packet_hwol_set_l3_len(struct dp_packet *b, int l3_len)
> >> +{
> >> +    b->mbuf.l3_len = l3_len;
> >> +}
> >
> >The same comment above is valid here.
> 
> In some cases, we do need to set l3_len if it isn't set correctly.
> 
> >
> >
> >> +
> >> +/* Set l4_len for the packet 'b' */
> >> +static inline void
> >> +dp_packet_hwol_set_l4_len(struct dp_packet *b, int l4_len)
> >> +{
> >> +    b->mbuf.l4_len = l4_len;
> >> +}
> >
> >
> >And here.
> 
> DPDK offload API will use l4_len to handle something but it isn't set in 
> vhost/virtio, so bring it in to set l4_len correctly.

Of course, but that's not the issue. OVS uses the struct dp_packet
fields to store the offsets. Therefore OVS uses APIs like for
example dp_packet_l3() to get the length:

        mbuf_dest->l2_len = (char *)dp_packet_l3(pkt_dest)
                                - (char *)dp_packet_eth(pkt_dest);
        mbuf_dest->l3_len = (char *)dp_packet_l4(pkt_dest)
                                - (char *) dp_packet_l3(pkt_dest);

The OVS works like this (simplified):
    [ recv func ] -> [ datapath processing ] -> [ send func ]

This patch is setting those mbufs fields at [ recv func ], but
the datapath processing can change the packet which is not aware
of those mbuf fields at all. So, the fields the patch is setting are
not valid anymore. Later, the [ send func ] will use those values,
as you correctly explained the reason, but they could be outdated values.

Since the datapath processing tries to be agnostic and doesn't change
those mbuf fields, simple packet forwarding will work without issues.
It works because the original values are still correct. However, if
the datapath pushes/pops a vlan header, for example, or do any operation
in the packet that requires updating any of those values, then [ send
func ] will find the wrong values. See eth_push_vlan() as an example.

Today the TSO works by setting the offloading flags at [ recv func ],
so we know for example if, in the case of Linux system devices, the GSO
with TCPv4 was used. The datapath needs to know otherwise during
packet processing the csum will not match, etc, otherwise it is
transparent.  Later, when the packet is going to [ send func ], it uses
struct dp_packet fields plus the offloading flags to set up the packet as
required: filling struct virtio_net_hdr [netdev_linux_prepend_vnet_hdr()]
or mbuf length fields [netdev_dpdk_prep_hwol_batch()].

[...]

[Yi Yang] Got it, we can add parse_header function in 
netdev_dpdk_prep_hwol_packet to reparse header
, for GRO, it is still necessary to parse header in recv function.

> >> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> >> index 44ebf96..30493ed 100644
> >> --- a/lib/netdev-dpdk.c
> >> +++ b/lib/netdev-dpdk.c
[...]
> >> @@ -986,8 +989,17 @@ dpdk_eth_dev_port_config(struct netdev_dpdk *dev, int 
> >> n_rxq, int n_txq)
> >>          conf.rxmode.offloads |= DEV_RX_OFFLOAD_KEEP_CRC;
> >>      }
> >>  
> >> +    if (info.rx_offload_capa & DEV_TX_OFFLOAD_MULTI_SEGS) {
> >> +        conf.txmode.offloads |= DEV_TX_OFFLOAD_MULTI_SEGS;
> >> +    }
> >
> >Why enabling multi seg is necessary?
> 
> it is mandatory requirement of GRO and GSO becuase they use
> multi-segment mbuf to avoid copy. GRO is very helpful to TCP
> perfoemance, GSO is necessary for UDP or TSO offoad isn't
> available on hardware NIC.

Perhaps I am missing something. If the NIC supports the flags
below, like for example VXLAN_TNL_TSO_OFFLOAD and OUTER_IPV4_CKSUM,
do we still need to enable DEV_TX_OFFLOAD_MULTI_SEGS?

[Yi Yang] Yes, DEV_TX_OFFLOAD_MULTI_SEGS means a mbuf is a chain of mbufs, this 
feature must be enabled explicitly, otherwise, NIC can't handle such chained 
mbuf.

If this is required by the GSO or GRO patch, then it should be
part of one of those patches and not here.

[Yi Yang] Yes, good suggestion, in order to avoid confusion, I'll send GRO and 
GSO patches in separate series.


> >> +
> >>      if (dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD) {
> >>          conf.txmode.offloads |= DPDK_TX_TSO_OFFLOAD_FLAGS;
> >> +        /* Enable VXLAN TSO support if available */
> >> +        if (dev->hw_ol_features & NETDEV_TX_VXLAN_TNL_TSO_OFFLOAD) {
> >> +            conf.txmode.offloads |= DEV_TX_OFFLOAD_VXLAN_TNL_TSO;
> >> +            conf.txmode.offloads |= DEV_TX_OFFLOAD_OUTER_IPV4_CKSUM;
> >> +        }
> >>          if (dev->hw_ol_features & NETDEV_TX_SCTP_CHECKSUM_OFFLOAD) {
> >>              conf.txmode.offloads |= DEV_TX_OFFLOAD_SCTP_CKSUM;
> >>          }
> >> @@ -1126,6 +1138,10 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
> >>          if ((info.tx_offload_capa & tx_tso_offload_capa)
> >>              == tx_tso_offload_capa) {
> >>              dev->hw_ol_features |= NETDEV_TX_TSO_OFFLOAD;
> >> +            /* Enable VXLAN TSO support if available */
> >> +            if (info.tx_offload_capa & DEV_TX_OFFLOAD_VXLAN_TNL_TSO) {
> >> +                dev->hw_ol_features |= NETDEV_TX_VXLAN_TNL_TSO_OFFLOAD;
> >> +            }
> >>              if (info.tx_offload_capa & DEV_TX_OFFLOAD_SCTP_CKSUM) {
> >>                  dev->hw_ol_features |= NETDEV_TX_SCTP_CHECKSUM_OFFLOAD;
> >>              } else {
> >> @@ -2131,35 +2147,166 @@ netdev_dpdk_rxq_dealloc(struct netdev_rxq *rxq)
> >>      rte_free(rx);
> >>  }
> >>  
> >> +static inline bool
> >> +is_local_to_local(uint16_t src_port_id, struct netdev_dpdk *dev)
> >> +{
> >> +    bool ret = false;
> >> +    struct netdev_dpdk *src_dev;
> >> +
> >> +    if (src_port_id == UINT16_MAX) {
> >> +        ret = true;
> >> +    } else {
> >> +        src_dev = netdev_dpdk_lookup_by_port_id(src_port_id);
> >> +        if (src_dev && (netdev_dpdk_get_vid(src_dev) >= 0)) {
> >> +            ret = true;
> >> +        }
> >> +    }
> >> +
> >> +    if (ret) {
> >> +        if (netdev_dpdk_get_vid(dev) < 0) {
> >> +            ret = false;
> >> +        }
> >> +    }
> >> +
> >> +    return ret;
> >> +}
> >> +
> >>  /* Prepare the packet for HWOL.
> >>   * Return True if the packet is OK to continue. */
> >>  static bool
> >>  netdev_dpdk_prep_hwol_packet(struct netdev_dpdk *dev, struct rte_mbuf 
> >> *mbuf)
> >>  {
> >>      struct dp_packet *pkt = CONTAINER_OF(mbuf, struct dp_packet, mbuf);
> >> +    uint16_t l4_proto = 0;
> >> +    uint8_t *l3_hdr_ptr = NULL;
> >> +    struct rte_ether_hdr *eth_hdr =
> >> +        rte_pktmbuf_mtod(mbuf, struct rte_ether_hdr *);
> >> +    struct rte_ipv4_hdr *ip_hdr;
> >> +    struct rte_ipv6_hdr *ip6_hdr;
> >> +
> >> +    /* Return directly if source and destitation of mbuf are local ports
> >> +     * because mbuf has already set ol_flags and l*_len correctly.
> >> +     */
> >> +    if (is_local_to_local(mbuf->port, dev)) {
> >> +        if (mbuf->ol_flags & (PKT_TX_TCP_SEG | PKT_TX_UDP_SEG)) {
> >> +            mbuf->tso_segsz = 1450 - mbuf->l3_len - mbuf->l4_len;
> >> +        }
> >> +        return true;
> >> +    }
> >
> >This assumption here doesn't seem correct. The packet could
> >have come from a NIC and then had been modified by data path,
> >so it doesn't mean the packet lengths and other fields are
> >correct in the mbuf. That's why we prepare here.
> 
> For local to local case, it is special, maybe l3_len and l4_len
> are incorrect, here we set tso_segsz for vhostuser interfaces and
> veth, I didn't see the case you mentioned, can you give a specific
> case? I can verify if it is ok. In the worst case, we can re-parse
> the packet before it to get correct l3_len and l4_len.

I think this comment and most of the comments below are related to
the design discussion above. So let's get that clarified and take
from there.

Thanks!
Fbl

[Yi Yang] Ok, Got it. Let us focus on " Enable VXLAN TSO for DPDK datapath" 
first and make sure it can be merged ASAP. I'll send new version of " Enable 
VXLAN TSO for DPDK datapath", that will fix your comments, thank you so much 
for reviewing.

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to