On 09.01.2019 21:05, Tiago Lam wrote:
> From: Mark Kavanagh <[email protected]>
> 
> Currently, jumbo frame support for OvS-DPDK is implemented by
> increasing the size of mbufs within a mempool, such that each mbuf
> within the pool is large enough to contain an entire jumbo frame of
> a user-defined size. Typically, for each user-defined MTU,
> 'requested_mtu', a new mempool is created, containing mbufs of size
> ~requested_mtu.
> 
> With the multi-segment approach, a port uses a single mempool,
> (containing standard/default-sized mbufs of ~2k bytes), irrespective
> of the user-requested MTU value. To accommodate jumbo frames, mbufs
> are chained together, where each mbuf in the chain stores a portion of
> the jumbo frame. Each mbuf in the chain is termed a segment, hence the
> name.
> 
> == Enabling multi-segment mbufs ==
> Multi-segment and single-segment mbufs are mutually exclusive, and the
> user must decide on which approach to adopt on init. The introduction
> of a new OVSDB field, 'dpdk-multi-seg-mbufs', facilitates this. This
> is a global boolean value, which determines how jumbo frames are
> represented across all DPDK ports. In the absence of a user-supplied
> value, 'dpdk-multi-seg-mbufs' defaults to false, i.e. multi-segment
> mbufs must be explicitly enabled / single-segment mbufs remain the
> default.
> 
> Setting the field is identical to setting existing DPDK-specific OVSDB
> fields:
> 
>     ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true
>     ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x10
>     ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=4096,0
> ==> ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true
> 
> Co-authored-by: Tiago Lam <[email protected]>
> 
> Signed-off-by: Mark Kavanagh <[email protected]>
> Signed-off-by: Tiago Lam <[email protected]>
> Acked-by: Eelco Chaudron <[email protected]>
> ---
>  Documentation/topics/dpdk/jumbo-frames.rst | 73 
> ++++++++++++++++++++++++++++++
>  Documentation/topics/dpdk/memory.rst       | 36 +++++++++++++++
>  NEWS                                       |  1 +
>  lib/dpdk.c                                 |  8 ++++
>  lib/netdev-dpdk.c                          | 66 +++++++++++++++++++++++----
>  lib/netdev-dpdk.h                          |  1 +
>  vswitchd/vswitch.xml                       | 22 +++++++++
>  7 files changed, 199 insertions(+), 8 deletions(-)
> 
> diff --git a/Documentation/topics/dpdk/jumbo-frames.rst 
> b/Documentation/topics/dpdk/jumbo-frames.rst
> index 00360b4..9804bbb 100644
> --- a/Documentation/topics/dpdk/jumbo-frames.rst
> +++ b/Documentation/topics/dpdk/jumbo-frames.rst
> @@ -71,3 +71,76 @@ Jumbo frame support has been validated against 9728B 
> frames, which is the
>  largest frame size supported by Fortville NIC using the DPDK i40e driver, but
>  larger frames and other DPDK NIC drivers may be supported. These cases are
>  common for use cases involving East-West traffic only.
> +
> +-------------------
> +Multi-segment mbufs
> +-------------------
> +
> +Instead of increasing the size of mbufs within a mempool, such that each mbuf
> +within the pool is large enough to contain an entire jumbo frame of a
> +user-defined size, mbufs can be chained together instead. In this approach 
> each
> +mbuf in the chain stores a portion of the jumbo frame, by default ~2K bytes,
> +irrespective of the user-requested MTU value. Since each mbuf in the chain is
> +termed a segment, this approach is named "multi-segment mbufs".
> +
> +This approach may bring more flexibility in use cases where the maximum 
> packet
> +length may be hard to guess. For example, in cases where packets originate 
> from
> +sources marked for offload (such as TSO), each packet may be larger than the
> +MTU, and as such, when forwarding it to a DPDK port a single mbuf may not be
> +enough to hold all of the packet's data.
> +
> +Multi-segment and single-segment mbufs are mutually exclusive, and the user
> +must decide on which approach to adopt on initialisation. If multi-segment
> +mbufs is to be enabled, it can be done so with the following command::
> +
> +    $ ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true
> +
> +Single-segment mbufs still remain the default when using OvS-DPDK, and the
> +above option `dpdk-multi-seg-mbufs` must be explicitly set to `true` if
> +multi-segment mbufs are to be used.
> +
> +~~~~~~~~~~~~~~~~~
> +Performance notes
> +~~~~~~~~~~~~~~~~~
> +
> +When using multi-segment mbufs some PMDs may not support vectorized Tx
> +functions, due to its non-contiguous nature. As a result this can hit
> +performance for smaller packet sizes. For example, on a setup sending 64B
> +packets at line rate, a decrease of ~20% has been observed. The performance
> +impact stops being noticeable for larger packet sizes, although the exact 
> size
> +will depend on each PMD, and vary between architectures.
> +
> +Tests performed with the i40e PMD driver only showed this limitation for 64B
> +packets, and the same rate was observed when comparing multi-segment mbufs 
> and
> +single-segment mbuf for 128B packets. In other words, the 20% drop in
> +performance was not observed for packets >= 128B during this test case.
> +
> +Because of this, multi-segment mbufs is not advised to be used with smaller
> +packet sizes, such as 64B.
> +
> +Also, note that using multi-segment mbufs won't improve memory usage. For a
> +packet of 9000B, for example, which would be stored on a single mbuf when 
> using
> +the single-segment approach, 5 mbufs (9000/2176) of 2176B would be needed to
> +store the same data using the multi-segment mbufs approach (refer to
> +:doc:`/topics/dpdk/memory` for examples).
> +
> +~~~~~~~~~~~
> +Limitations
> +~~~~~~~~~~~
> +
> +Because multi-segment mbufs store the data uncontiguously in memory, when 
> used
> +across DPDK and non-DPDK ports, a performance drop is expected, as the mbufs'
> +content needs to be copied into a contiguous region in memory to be used by
> +operations such as write(). Exchanging traffic between DPDK ports (such as
> +vhost and physical ports) doesn't have this limitation, however.
> +
> +Other operations may have a hit in performance as well, under the current
> +implementation. For example, operations that require a checksum to be 
> performed
> +on the data, such as pushing / popping a VXLAN header, will also require a 
> copy
> +of the data (if it hasn't been copied before), or when using the Userspace
> +connection tracker.
> +
> +Finally, it is assumed that, when enabling the multi-segment mbufs, a packet
> +header falls within the first mbuf, which is 2K in size. This is required
> +because at the moment the miniflow extraction and setting of the layer 
> headers
> +(l2_5, l3, l4) assumes contiguous access to memory.
> diff --git a/Documentation/topics/dpdk/memory.rst 
> b/Documentation/topics/dpdk/memory.rst
> index 9ebfd11..7f414ef 100644
> --- a/Documentation/topics/dpdk/memory.rst
> +++ b/Documentation/topics/dpdk/memory.rst
> @@ -82,6 +82,14 @@ Users should be aware of the following:
>  Below are a number of examples of memory requirement calculations for both
>  shared and per port memory models.
>  
> +.. note::
> +
> +   If multi-segment mbufs is enabled (:doc:`/topics/dpdk/jumbo-frames`), both
> +   the **number of mbufs** and the **size of each mbuf** might be adjusted,
> +   which might change slightly the amount of memory required for a given
> +   mempool. Examples of how these calculations are performed are also 
> provided
> +   below, for the higher MTU case of each memory model.
> +
>  Shared Memory Calculations
>  ~~~~~~~~~~~~~~~~~~~~~~~~~~
>  
> @@ -142,6 +150,20 @@ Example 4
>   Mbuf size = 10176 Bytes
>   Memory required = 262144 * 10176 = 2667 MB
>  
> +Example 5 (multi-segment mbufs enabled)
> ++++++++++++++++++++++++++++++++++++++++
> +::
> +
> + MTU = 9000 Bytes
> + Number of mbufs = 262144
> + Mbuf size = 2048 Bytes
> + Memory required = 262144 * (2048 * 5) = 2684 MB
> +
> +.. note::
> +
> +   In order to hold 9000B of data, 5 mbufs of 2048B each will be needed, 
> hence
> +   the "5" above in 2048 * 5.
> +
>  Per Port Memory Calculations
>  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>  
> @@ -214,3 +236,17 @@ Example 3: (2 rxq, 2 PMD, 9000 MTU)
>   Number of mbufs = (2 * 2048) + (3 * 2048) + (1 * 32) + (16384) = 26656
>   Mbuf size = 10176 Bytes
>   Memory required = 26656 * 10176 = 271 MB
> +
> +Example 4: (2 rxq, 2 PMD, 9000 MTU, multi-segment mbufs enabled)
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> +::
> +
> + MTU = 9000
> + Number of mbufs = (2 * 2048) + (3 * 2048) + (1 * 32) + (16384) = 26656
> + Mbuf size = 2048 Bytes
> + Memory required = 26656 * (2048 * 5) = 273 MB
> +
> +.. note::
> +
> +   In order to hold 9000B of data, 5 mbufs of 2048B each will be needed, 
> hence
> +   the "5" above in 2048 * 5.
> diff --git a/NEWS b/NEWS
> index 2de844f..98f5a9b 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -76,6 +76,7 @@ v2.10.0 - 18 Aug 2018
>       * Allow init to fail and record DPDK status/version in OVS database.
>       * Add experimental flow hardware offload support
>       * Support both shared and per port mempools for DPDK devices.
> +     * Add support for multi-segment mbufs.
>     - Userspace datapath:
>       * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a single 
> PMD
>       * Detailed PMD performance metrics available with new command
> diff --git a/lib/dpdk.c b/lib/dpdk.c
> index 0ee3e19..ac89fd8 100644
> --- a/lib/dpdk.c
> +++ b/lib/dpdk.c
> @@ -497,6 +497,14 @@ dpdk_init__(const struct smap *ovs_other_config)
>  
>      /* Finally, register the dpdk classes */
>      netdev_dpdk_register();
> +
> +    bool multi_seg_mbufs_enable = smap_get_bool(ovs_other_config,
> +            "dpdk-multi-seg-mbufs", false);
> +    if (multi_seg_mbufs_enable) {
> +        VLOG_INFO("DPDK multi-segment mbufs enabled\n");
> +        netdev_dpdk_multi_segment_mbufs_enable();
> +    }
> +
>      return true;
>  }
>  
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index 7a9add7..d6114ee 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -70,6 +70,7 @@ enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
>  
>  VLOG_DEFINE_THIS_MODULE(netdev_dpdk);
>  static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
> +static bool dpdk_multi_segment_mbufs = false;
>  
>  #define DPDK_PORT_WATCHDOG_INTERVAL 5
>  
> @@ -519,6 +520,12 @@ is_dpdk_class(const struct netdev_class *class)
>             || class->destruct == netdev_dpdk_vhost_destruct;
>  }
>  
> +void
> +netdev_dpdk_multi_segment_mbufs_enable(void)
> +{
> +    dpdk_multi_segment_mbufs = true;
> +}
> +
>  /* DPDK NIC drivers allocate RX buffers at a particular granularity, 
> typically
>   * aligned at 1k or less. If a declared mbuf size is not a multiple of this
>   * value, insufficient buffers are allocated to accomodate the packet in its
> @@ -632,14 +639,17 @@ dpdk_mp_sweep(void) OVS_REQUIRES(dpdk_mp_mutex)
>      }
>  }
>  
> -/* Calculating the required number of mbufs differs depending on the
> - * mempool model being used. Check if per port memory is in use before
> - * calculating.
> - */
> +/* Calculating the required number of mbufs differs depending on the mempool
> + * model (per port vs shared mempools) being used.
> + * In case multi-segment mbufs are being used, the number of mbufs is also
> + * increased, to account for the multiple mbufs needed to hold each packet's
> + * data. */
>  static uint32_t
> -dpdk_calculate_mbufs(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
> +dpdk_calculate_mbufs(struct netdev_dpdk *dev, int mtu, uint32_t mbuf_size,
> +                     bool per_port_mp)
>  {
>      uint32_t n_mbufs;
> +    uint16_t max_frame_len = 0;
>  
>      if (!per_port_mp) {
>          /* Shared memory are being used.
> @@ -668,6 +678,22 @@ dpdk_calculate_mbufs(struct netdev_dpdk *dev, int mtu, 
> bool per_port_mp)
>                    + MIN_NB_MBUF;
>      }
>  
> +    /* If multi-segment mbufs are used, we also increase the number of
> +     * mbufs used. This is done by calculating how many mbufs are needed to
> +     * hold the data on a single packet of MTU size. For example, for a
> +     * received packet of 9000B, 5 mbufs (9000 / 2048) are needed to hold
> +     * the data - 4 more than with single-mbufs (as mbufs' size is extended
> +     * to hold all data) */
> +    max_frame_len = MTU_TO_MAX_FRAME_LEN(dev->requested_mtu);
> +    if (dpdk_multi_segment_mbufs && mbuf_size < max_frame_len) {
> +        uint16_t nb_segs = max_frame_len / mbuf_size;
> +        if (max_frame_len % mbuf_size) {
> +            nb_segs += 1;
> +        }
> +
> +        n_mbufs *= nb_segs;
> +    }
> +
>      return n_mbufs;
>  }
>  
> @@ -696,8 +722,12 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool 
> per_port_mp)
>  
>      /* Get the size of each mbuf, based on the MTU */
>      mbuf_size = MTU_TO_FRAME_LEN(mtu);
> +    /* multi-segment mbufs - use standard mbuf size */
> +    if (dpdk_multi_segment_mbufs) {
> +        mbuf_size = dpdk_buf_size(ETHER_MTU);
> +    }
>  
> -    n_mbufs = dpdk_calculate_mbufs(dev, mtu, per_port_mp);
> +    n_mbufs = dpdk_calculate_mbufs(dev, mtu, mbuf_size, per_port_mp);
>  
>      do {
>          /* Full DPDK memory pool name must be unique and cannot be
> @@ -956,6 +986,7 @@ dpdk_eth_dev_port_config(struct netdev_dpdk *dev, int 
> n_rxq, int n_txq)
>      int diag = 0;
>      int i;
>      struct rte_eth_conf conf = port_conf;
> +    struct rte_eth_txconf txconf;
>      struct rte_eth_dev_info info;
>      uint16_t conf_mtu;
>  
> @@ -971,6 +1002,24 @@ dpdk_eth_dev_port_config(struct netdev_dpdk *dev, int 
> n_rxq, int n_txq)
>          }
>      }
>  
> +    /* Multi-segment-mbuf-specific setup. */
> +    if (dpdk_multi_segment_mbufs) {
> +        if (info.tx_offload_capa & DEV_TX_OFFLOAD_MULTI_SEGS) {
> +            /* Enable multi-seg mbufs. DPDK PMDs typically attempt to use
> +             * simple or vectorized transmit functions, neither of which are
> +             * compatible with multi-segment mbufs. */
> +            conf.txmode.offloads |= DEV_TX_OFFLOAD_MULTI_SEGS;
> +        } else {
> +            VLOG_WARN("Interface %s doesn't support multi-segment mbufs",
> +                      dev->up.name);
> +            conf.txmode.offloads &= ~DEV_TX_OFFLOAD_MULTI_SEGS;

Simple warning is not enough.
There are few PMDs that does not support segmented packets and does not
expect them. Sending segmented packets to such ports could cause a crash,
memory leaks or any other unexpected behaviour and they will simply not
work, at first.
We need a fallback solution for this case.

Best regards, Ilya Maximets.
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to