On 09.01.2019 21:05, Tiago Lam wrote: > From: Mark Kavanagh <[email protected]> > > Currently, jumbo frame support for OvS-DPDK is implemented by > increasing the size of mbufs within a mempool, such that each mbuf > within the pool is large enough to contain an entire jumbo frame of > a user-defined size. Typically, for each user-defined MTU, > 'requested_mtu', a new mempool is created, containing mbufs of size > ~requested_mtu. > > With the multi-segment approach, a port uses a single mempool, > (containing standard/default-sized mbufs of ~2k bytes), irrespective > of the user-requested MTU value. To accommodate jumbo frames, mbufs > are chained together, where each mbuf in the chain stores a portion of > the jumbo frame. Each mbuf in the chain is termed a segment, hence the > name. > > == Enabling multi-segment mbufs == > Multi-segment and single-segment mbufs are mutually exclusive, and the > user must decide on which approach to adopt on init. The introduction > of a new OVSDB field, 'dpdk-multi-seg-mbufs', facilitates this. This > is a global boolean value, which determines how jumbo frames are > represented across all DPDK ports. In the absence of a user-supplied > value, 'dpdk-multi-seg-mbufs' defaults to false, i.e. multi-segment > mbufs must be explicitly enabled / single-segment mbufs remain the > default. > > Setting the field is identical to setting existing DPDK-specific OVSDB > fields: > > ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true > ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x10 > ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=4096,0 > ==> ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true > > Co-authored-by: Tiago Lam <[email protected]> > > Signed-off-by: Mark Kavanagh <[email protected]> > Signed-off-by: Tiago Lam <[email protected]> > Acked-by: Eelco Chaudron <[email protected]> > --- > Documentation/topics/dpdk/jumbo-frames.rst | 73 > ++++++++++++++++++++++++++++++ > Documentation/topics/dpdk/memory.rst | 36 +++++++++++++++ > NEWS | 1 + > lib/dpdk.c | 8 ++++ > lib/netdev-dpdk.c | 66 +++++++++++++++++++++++---- > lib/netdev-dpdk.h | 1 + > vswitchd/vswitch.xml | 22 +++++++++ > 7 files changed, 199 insertions(+), 8 deletions(-) > > diff --git a/Documentation/topics/dpdk/jumbo-frames.rst > b/Documentation/topics/dpdk/jumbo-frames.rst > index 00360b4..9804bbb 100644 > --- a/Documentation/topics/dpdk/jumbo-frames.rst > +++ b/Documentation/topics/dpdk/jumbo-frames.rst > @@ -71,3 +71,76 @@ Jumbo frame support has been validated against 9728B > frames, which is the > largest frame size supported by Fortville NIC using the DPDK i40e driver, but > larger frames and other DPDK NIC drivers may be supported. These cases are > common for use cases involving East-West traffic only. > + > +------------------- > +Multi-segment mbufs > +------------------- > + > +Instead of increasing the size of mbufs within a mempool, such that each mbuf > +within the pool is large enough to contain an entire jumbo frame of a > +user-defined size, mbufs can be chained together instead. In this approach > each > +mbuf in the chain stores a portion of the jumbo frame, by default ~2K bytes, > +irrespective of the user-requested MTU value. Since each mbuf in the chain is > +termed a segment, this approach is named "multi-segment mbufs". > + > +This approach may bring more flexibility in use cases where the maximum > packet > +length may be hard to guess. For example, in cases where packets originate > from > +sources marked for offload (such as TSO), each packet may be larger than the > +MTU, and as such, when forwarding it to a DPDK port a single mbuf may not be > +enough to hold all of the packet's data. > + > +Multi-segment and single-segment mbufs are mutually exclusive, and the user > +must decide on which approach to adopt on initialisation. If multi-segment > +mbufs is to be enabled, it can be done so with the following command:: > + > + $ ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true > + > +Single-segment mbufs still remain the default when using OvS-DPDK, and the > +above option `dpdk-multi-seg-mbufs` must be explicitly set to `true` if > +multi-segment mbufs are to be used. > + > +~~~~~~~~~~~~~~~~~ > +Performance notes > +~~~~~~~~~~~~~~~~~ > + > +When using multi-segment mbufs some PMDs may not support vectorized Tx > +functions, due to its non-contiguous nature. As a result this can hit > +performance for smaller packet sizes. For example, on a setup sending 64B > +packets at line rate, a decrease of ~20% has been observed. The performance > +impact stops being noticeable for larger packet sizes, although the exact > size > +will depend on each PMD, and vary between architectures. > + > +Tests performed with the i40e PMD driver only showed this limitation for 64B > +packets, and the same rate was observed when comparing multi-segment mbufs > and > +single-segment mbuf for 128B packets. In other words, the 20% drop in > +performance was not observed for packets >= 128B during this test case. > + > +Because of this, multi-segment mbufs is not advised to be used with smaller > +packet sizes, such as 64B. > + > +Also, note that using multi-segment mbufs won't improve memory usage. For a > +packet of 9000B, for example, which would be stored on a single mbuf when > using > +the single-segment approach, 5 mbufs (9000/2176) of 2176B would be needed to > +store the same data using the multi-segment mbufs approach (refer to > +:doc:`/topics/dpdk/memory` for examples). > + > +~~~~~~~~~~~ > +Limitations > +~~~~~~~~~~~ > + > +Because multi-segment mbufs store the data uncontiguously in memory, when > used > +across DPDK and non-DPDK ports, a performance drop is expected, as the mbufs' > +content needs to be copied into a contiguous region in memory to be used by > +operations such as write(). Exchanging traffic between DPDK ports (such as > +vhost and physical ports) doesn't have this limitation, however. > + > +Other operations may have a hit in performance as well, under the current > +implementation. For example, operations that require a checksum to be > performed > +on the data, such as pushing / popping a VXLAN header, will also require a > copy > +of the data (if it hasn't been copied before), or when using the Userspace > +connection tracker. > + > +Finally, it is assumed that, when enabling the multi-segment mbufs, a packet > +header falls within the first mbuf, which is 2K in size. This is required > +because at the moment the miniflow extraction and setting of the layer > headers > +(l2_5, l3, l4) assumes contiguous access to memory. > diff --git a/Documentation/topics/dpdk/memory.rst > b/Documentation/topics/dpdk/memory.rst > index 9ebfd11..7f414ef 100644 > --- a/Documentation/topics/dpdk/memory.rst > +++ b/Documentation/topics/dpdk/memory.rst > @@ -82,6 +82,14 @@ Users should be aware of the following: > Below are a number of examples of memory requirement calculations for both > shared and per port memory models. > > +.. note:: > + > + If multi-segment mbufs is enabled (:doc:`/topics/dpdk/jumbo-frames`), both > + the **number of mbufs** and the **size of each mbuf** might be adjusted, > + which might change slightly the amount of memory required for a given > + mempool. Examples of how these calculations are performed are also > provided > + below, for the higher MTU case of each memory model. > + > Shared Memory Calculations > ~~~~~~~~~~~~~~~~~~~~~~~~~~ > > @@ -142,6 +150,20 @@ Example 4 > Mbuf size = 10176 Bytes > Memory required = 262144 * 10176 = 2667 MB > > +Example 5 (multi-segment mbufs enabled) > ++++++++++++++++++++++++++++++++++++++++ > +:: > + > + MTU = 9000 Bytes > + Number of mbufs = 262144 > + Mbuf size = 2048 Bytes > + Memory required = 262144 * (2048 * 5) = 2684 MB > + > +.. note:: > + > + In order to hold 9000B of data, 5 mbufs of 2048B each will be needed, > hence > + the "5" above in 2048 * 5. > + > Per Port Memory Calculations > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > @@ -214,3 +236,17 @@ Example 3: (2 rxq, 2 PMD, 9000 MTU) > Number of mbufs = (2 * 2048) + (3 * 2048) + (1 * 32) + (16384) = 26656 > Mbuf size = 10176 Bytes > Memory required = 26656 * 10176 = 271 MB > + > +Example 4: (2 rxq, 2 PMD, 9000 MTU, multi-segment mbufs enabled) > +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > +:: > + > + MTU = 9000 > + Number of mbufs = (2 * 2048) + (3 * 2048) + (1 * 32) + (16384) = 26656 > + Mbuf size = 2048 Bytes > + Memory required = 26656 * (2048 * 5) = 273 MB > + > +.. note:: > + > + In order to hold 9000B of data, 5 mbufs of 2048B each will be needed, > hence > + the "5" above in 2048 * 5. > diff --git a/NEWS b/NEWS > index 2de844f..98f5a9b 100644 > --- a/NEWS > +++ b/NEWS > @@ -76,6 +76,7 @@ v2.10.0 - 18 Aug 2018 > * Allow init to fail and record DPDK status/version in OVS database. > * Add experimental flow hardware offload support > * Support both shared and per port mempools for DPDK devices. > + * Add support for multi-segment mbufs. > - Userspace datapath: > * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a single > PMD > * Detailed PMD performance metrics available with new command > diff --git a/lib/dpdk.c b/lib/dpdk.c > index 0ee3e19..ac89fd8 100644 > --- a/lib/dpdk.c > +++ b/lib/dpdk.c > @@ -497,6 +497,14 @@ dpdk_init__(const struct smap *ovs_other_config) > > /* Finally, register the dpdk classes */ > netdev_dpdk_register(); > + > + bool multi_seg_mbufs_enable = smap_get_bool(ovs_other_config, > + "dpdk-multi-seg-mbufs", false); > + if (multi_seg_mbufs_enable) { > + VLOG_INFO("DPDK multi-segment mbufs enabled\n"); > + netdev_dpdk_multi_segment_mbufs_enable(); > + } > + > return true; > } > > diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c > index 7a9add7..d6114ee 100644 > --- a/lib/netdev-dpdk.c > +++ b/lib/netdev-dpdk.c > @@ -70,6 +70,7 @@ enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM}; > > VLOG_DEFINE_THIS_MODULE(netdev_dpdk); > static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20); > +static bool dpdk_multi_segment_mbufs = false; > > #define DPDK_PORT_WATCHDOG_INTERVAL 5 > > @@ -519,6 +520,12 @@ is_dpdk_class(const struct netdev_class *class) > || class->destruct == netdev_dpdk_vhost_destruct; > } > > +void > +netdev_dpdk_multi_segment_mbufs_enable(void) > +{ > + dpdk_multi_segment_mbufs = true; > +} > + > /* DPDK NIC drivers allocate RX buffers at a particular granularity, > typically > * aligned at 1k or less. If a declared mbuf size is not a multiple of this > * value, insufficient buffers are allocated to accomodate the packet in its > @@ -632,14 +639,17 @@ dpdk_mp_sweep(void) OVS_REQUIRES(dpdk_mp_mutex) > } > } > > -/* Calculating the required number of mbufs differs depending on the > - * mempool model being used. Check if per port memory is in use before > - * calculating. > - */ > +/* Calculating the required number of mbufs differs depending on the mempool > + * model (per port vs shared mempools) being used. > + * In case multi-segment mbufs are being used, the number of mbufs is also > + * increased, to account for the multiple mbufs needed to hold each packet's > + * data. */ > static uint32_t > -dpdk_calculate_mbufs(struct netdev_dpdk *dev, int mtu, bool per_port_mp) > +dpdk_calculate_mbufs(struct netdev_dpdk *dev, int mtu, uint32_t mbuf_size, > + bool per_port_mp) > { > uint32_t n_mbufs; > + uint16_t max_frame_len = 0; > > if (!per_port_mp) { > /* Shared memory are being used. > @@ -668,6 +678,22 @@ dpdk_calculate_mbufs(struct netdev_dpdk *dev, int mtu, > bool per_port_mp) > + MIN_NB_MBUF; > } > > + /* If multi-segment mbufs are used, we also increase the number of > + * mbufs used. This is done by calculating how many mbufs are needed to > + * hold the data on a single packet of MTU size. For example, for a > + * received packet of 9000B, 5 mbufs (9000 / 2048) are needed to hold > + * the data - 4 more than with single-mbufs (as mbufs' size is extended > + * to hold all data) */ > + max_frame_len = MTU_TO_MAX_FRAME_LEN(dev->requested_mtu); > + if (dpdk_multi_segment_mbufs && mbuf_size < max_frame_len) { > + uint16_t nb_segs = max_frame_len / mbuf_size; > + if (max_frame_len % mbuf_size) { > + nb_segs += 1; > + } > + > + n_mbufs *= nb_segs; > + } > + > return n_mbufs; > } > > @@ -696,8 +722,12 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool > per_port_mp) > > /* Get the size of each mbuf, based on the MTU */ > mbuf_size = MTU_TO_FRAME_LEN(mtu); > + /* multi-segment mbufs - use standard mbuf size */ > + if (dpdk_multi_segment_mbufs) { > + mbuf_size = dpdk_buf_size(ETHER_MTU); > + } > > - n_mbufs = dpdk_calculate_mbufs(dev, mtu, per_port_mp); > + n_mbufs = dpdk_calculate_mbufs(dev, mtu, mbuf_size, per_port_mp); > > do { > /* Full DPDK memory pool name must be unique and cannot be > @@ -956,6 +986,7 @@ dpdk_eth_dev_port_config(struct netdev_dpdk *dev, int > n_rxq, int n_txq) > int diag = 0; > int i; > struct rte_eth_conf conf = port_conf; > + struct rte_eth_txconf txconf; > struct rte_eth_dev_info info; > uint16_t conf_mtu; > > @@ -971,6 +1002,24 @@ dpdk_eth_dev_port_config(struct netdev_dpdk *dev, int > n_rxq, int n_txq) > } > } > > + /* Multi-segment-mbuf-specific setup. */ > + if (dpdk_multi_segment_mbufs) { > + if (info.tx_offload_capa & DEV_TX_OFFLOAD_MULTI_SEGS) { > + /* Enable multi-seg mbufs. DPDK PMDs typically attempt to use > + * simple or vectorized transmit functions, neither of which are > + * compatible with multi-segment mbufs. */ > + conf.txmode.offloads |= DEV_TX_OFFLOAD_MULTI_SEGS; > + } else { > + VLOG_WARN("Interface %s doesn't support multi-segment mbufs", > + dev->up.name); > + conf.txmode.offloads &= ~DEV_TX_OFFLOAD_MULTI_SEGS;
Simple warning is not enough. There are few PMDs that does not support segmented packets and does not expect them. Sending segmented packets to such ports could cause a crash, memory leaks or any other unexpected behaviour and they will simply not work, at first. We need a fallback solution for this case. Best regards, Ilya Maximets. _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
