On 03/10/2018 19:27, Flavio Leitner wrote:
> On Fri, Sep 28, 2018 at 05:15:12PM +0100, Tiago Lam wrote:
>> From: Mark Kavanagh <[email protected]>
>>
>> Currently, jumbo frame support for OvS-DPDK is implemented by
>> increasing the size of mbufs within a mempool, such that each mbuf
>> within the pool is large enough to contain an entire jumbo frame of
>> a user-defined size. Typically, for each user-defined MTU,
>> 'requested_mtu', a new mempool is created, containing mbufs of size
>> ~requested_mtu.
>>
>> With the multi-segment approach, a port uses a single mempool,
>> (containing standard/default-sized mbufs of ~2k bytes), irrespective
>> of the user-requested MTU value. To accommodate jumbo frames, mbufs
>> are chained together, where each mbuf in the chain stores a portion of
>> the jumbo frame. Each mbuf in the chain is termed a segment, hence the
>> name.
>>
>> == Enabling multi-segment mbufs ==
>> Multi-segment and single-segment mbufs are mutually exclusive, and the
>> user must decide on which approach to adopt on init. The introduction
>> of a new OVSDB field, 'dpdk-multi-seg-mbufs', facilitates this. This
>> is a global boolean value, which determines how jumbo frames are
>> represented across all DPDK ports. In the absence of a user-supplied
>> value, 'dpdk-multi-seg-mbufs' defaults to false, i.e. multi-segment
>> mbufs must be explicitly enabled / single-segment mbufs remain the
>> default.
>>
>> Setting the field is identical to setting existing DPDK-specific OVSDB
>> fields:
>>
>> ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true
>> ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x10
>> ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=4096,0
>> ==> ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true
>>
>> Co-authored-by: Tiago Lam <[email protected]>
>>
>> Signed-off-by: Mark Kavanagh <[email protected]>
>> Signed-off-by: Tiago Lam <[email protected]>
>> Acked-by: Eelco Chaudron <[email protected]>
>> ---
>> Documentation/topics/dpdk/jumbo-frames.rst | 73
>> ++++++++++++++++++++++++++++++
>> Documentation/topics/dpdk/memory.rst | 36 +++++++++++++++
>> NEWS | 1 +
>> lib/dpdk.c | 8 ++++
>> lib/netdev-dpdk.c | 60 ++++++++++++++++++++----
>> lib/netdev-dpdk.h | 1 +
>> vswitchd/vswitch.xml | 22 +++++++++
>> 7 files changed, 193 insertions(+), 8 deletions(-)
>>
>> diff --git a/Documentation/topics/dpdk/jumbo-frames.rst
>> b/Documentation/topics/dpdk/jumbo-frames.rst
>> index 00360b4..5888f1e 100644
>> --- a/Documentation/topics/dpdk/jumbo-frames.rst
>> +++ b/Documentation/topics/dpdk/jumbo-frames.rst
>> @@ -71,3 +71,76 @@ Jumbo frame support has been validated against 9728B
>> frames, which is the
>> largest frame size supported by Fortville NIC using the DPDK i40e driver,
>> but
>> larger frames and other DPDK NIC drivers may be supported. These cases are
>> common for use cases involving East-West traffic only.
>> +
>> +-------------------
>> +Multi-segment mbufs
>> +-------------------
>> +
>> +Instead of increasing the size of mbufs within a mempool, such that each
>> mbuf
>> +within the pool is large enough to contain an entire jumbo frame of a
>> +user-defined size, mbufs can be chained together instead. In this approach
>> each
>> +mbuf in the chain stores a portion of the jumbo frame, by default ~2K bytes,
>> +irrespective of the user-requested MTU value. Since each mbuf in the chain
>> is
>> +termed a segment, this approach is named "multi-segment mbufs".
>> +
>> +This approach may bring more flexibility in use cases where the maximum
>> packet
>> +length may be hard to guess. For example, in cases where packets originate
>> from
>> +sources marked for oflload (such as TSO), each packet may be larger than the
> typo ^^^^^^
Fixed, thanks.
>
>> +MTU, and as such, when forwarding it to a DPDK port a single mbuf may not be
>> +enough to hold all of the packet's data.
>> +
>> +Multi-segment and single-segment mbufs are mutually exclusive, and the user
>> +must decide on which approach to adopt on initialisation. If multi-segment
>> +mbufs is to be enabled, it can be done so with the following command::
>> +
>> + $ ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true
>> +
>> +Single-segment mbufs still remain the default when using OvS-DPDK, and the
>> +above option `dpdk-multi-seg-mbufs` must be explicitly set to `true` if
>> +multi-segment mbufs are to be used.
>> +
>> +~~~~~~~~~~~~~~~~~
>> +Performance notes
>> +~~~~~~~~~~~~~~~~~
>> +
>> +When using multi-segment mbufs some PMDs may not support vectorized Tx
>> +functions, due to its non-contiguous nature. As a result this can hit
>> +performance for smaller packet sizes. For example, on a setup sending 64B
>> +packets at line rate, a decrease of ~20% has been observed. The performance
>> +impact stops being noticeable for larger packet sizes, although the exact
>> size
>> +will between PMDs, and depending on the architecture one's using.
>
> perhaps 'will depend on each PMD'?
Thanks, I've used "will depend on each PMD, and vary between
architectures.".
>
>> +
>> +Tests performed with the i40e PMD driver only showed this limitation for 64B
>> +packets, and the same rate was observed when comparing multi-segment mbufs
>> and
>> +single-segment mbuf for 128B packets. In other words, the 20% drop in
>> +performance was not observed for packets >= 128B during this test case.
>> +
>> +Because of this, multi-segment mbufs is not advised to be used with smaller
>> +packet sizes, such as 64B.
>> +
>> +Also, note that using multi-segment mbufs won't improve memory usage. For a
>> +packet of 9000B, for example, which would be stored on a single mbuf when
>> using
>> +the single-segment approach, 5 mbufs (9000/2176) of 2176B would be needed to
>> +store the same data using the multi-segment mbufs approach (refer to
>> +:doc:`/topics/dpdk/memory` for examples).
>> +
>> +~~~~~~~~~~~
>> +Limitations
>> +~~~~~~~~~~~
>> +
>> +Because multi-segment mbufs store the data uncontiguously in memory, when
>> used
>> +across DPDK and non-DPDK ports, a performance drop is expected, as the
>> mbufs'
>> +content needs to be copied into a contiguous region in memory to be used by
>> +operations such as write(). Exchanging traffic between DPDK ports (such as
>> +vhost and physical ports) doesn't have this limitation, however.
>> +
>> +Other operations may have a hit in performance as well, under the current
>> +implementation. For example, operations that require a checksum to be
>> performed
>> +on the data, such as pushing / popping a VXLAN header, will also require a
>> copy
>> +of the data (if it hasn't been copied before), or when using the Userspace
>> +connection tracker.
>> +
>> +Finally, it is assumed that, when enabling the multi-segment mbufs, a packet
>> +header falls within the first mbuf, which is 2K in size. This is required
>> +because at the moment the miniflow extraction and setting of the layer
>> headers
>> +(l2_5, l3, l4) assumes contiguous access to memory.
>
> OVS-DPDK was focused on very small packets in the beginning, but what
> I am seeing in the production environments is that there is a mix
> packet sizes. Some VNF still use small packets, but the same ovs-dpdk
> needs to work with big chunks of data as well.
>
> Having said that, I think this feature should be ON by default with
> the option to be disabled if someone really wants to squeeze extra
> performance. We need to make it easy for most users. For specific
> use cases, experts can tune for more.
>
> Another thing is that users increase MTU because they need to, so
> there is an expectation of having big packets. Because of that I think
> the default should be to create mempools respecting the given MTU size,
> and we chain them if we have more data. So, if we can handle 64k
> (TSO), with MTU of 9000, it would need 7~8 mbufs. I suspect that if
> we can enable this by default, the MTU in some use cases most probably
> will remain in the default 1500. In any case, it will help to avoid the
> non-linearity cost for packets above 1500 when MTU is bumped.
I would have to agree here. I hadn't really considered changing the mbuf
size, but that seems like a good approach to me. It could improve
usability to some degree. Maybe we could start with it disabled but
moving towards enabling it when it becomes better fleshed out?
>
> Regarding to the 64B 20% performance hit, does that come only from
> changing the driver's TX function?
It does, yeah. I wonder if there's room for improvement here as well.
Right now DPDK seems to not choose the TX vectorized functions if the
ETH_TXQ_FLAGS_NOMULTSEGS flag isn't set (among others). But it could
maybe be a bit more dynamic and check which function to use depending if
the mbuf is a chained mbuf or not.
>
> fbl
Thanks for the review Flavio, much appreciated. I'll send a new
iteration next week with fixes/improvements and including the data_read
function discussed, for pulling data.
Tiago.
>
>
>> diff --git a/Documentation/topics/dpdk/memory.rst
>> b/Documentation/topics/dpdk/memory.rst
>> index e5fb166..d8a952a 100644
>> --- a/Documentation/topics/dpdk/memory.rst
>> +++ b/Documentation/topics/dpdk/memory.rst
>> @@ -82,6 +82,14 @@ Users should be aware of the following:
>> Below are a number of examples of memory requirement calculations for both
>> shared and per port memory models.
>>
>> +.. note::
>> +
>> + If multi-segment mbufs is enabled (:doc:`/topics/dpdk/jumbo-frames`),
>> both
>> + the **number of mbufs** and the **size of each mbuf** might be adjusted,
>> + which might change slightly the amount of memory required for a given
>> + mempool. Examples of how these calculations are performed are also
>> provided
>> + below, for the higher MTU case of each memory model.
>> +
>> Shared Memory Calculations
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~
>>
>> @@ -142,6 +150,20 @@ Example 4
>> Mbuf size = 10176 Bytes
>> Memory required = 262144 * 10176 = 2667 MB
>>
>> +Example 5 (multi-segment mbufs enabled)
>> ++++++++++++++++++++++++++++++++++++++++
>> +::
>> +
>> + MTU = 9000 Bytes
>> + Number of mbufs = 262144
>> + Mbuf size = 2176 Bytes
>> + Memory required = 262144 * (2176 * 5) = 2852 MB
>> +
>> +.. note::
>> +
>> + In order to hold 9000B of data, 5 mbufs of 2176B each will be needed,
>> hence
>> + the "5" above in 2176 * 5.
>> +
>> Per Port Memory Calculations
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>
>> @@ -214,3 +236,17 @@ Example 3: (2 rxq, 2 PMD, 9000 MTU)
>> Number of mbufs = (2 * 2048) + (3 * 2048) + (1 * 32) + (16384) = 26656
>> Mbuf size = 10176 Bytes
>> Memory required = 26656 * 10176 = 271 MB
>> +
>> +Example 4: (2 rxq, 2 PMD, 9000 MTU, multi-segment mbufs enabled)
>> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> +::
>> +
>> + MTU = 9000
>> + Number of mbufs = (2 * 2048) + (3 * 2048) + (1 * 32) + (16384) = 26656
>> + Mbuf size = 2176 Bytes
>> + Memory required = 26656 * (2176 * 5) = 290 MB
>> +
>> +.. note::
>> +
>> + In order to hold 9000B of data, 5 mbufs of 2176B each will be needed,
>> hence
>> + the "5" above in 2176 * 5.
>> diff --git a/NEWS b/NEWS
>> index 8b6f3ce..929088b 100644
>> --- a/NEWS
>> +++ b/NEWS
>> @@ -57,6 +57,7 @@ v2.10.0 - 18 Aug 2018
>> * Allow init to fail and record DPDK status/version in OVS database.
>> * Add experimental flow hardware offload support
>> * Support both shared and per port mempools for DPDK devices.
>> + * Add support for multi-segment mbufs.
>> - Userspace datapath:
>> * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a single
>> PMD
>> * Detailed PMD performance metrics available with new command
>> diff --git a/lib/dpdk.c b/lib/dpdk.c
>> index 0ee3e19..ac89fd8 100644
>> --- a/lib/dpdk.c
>> +++ b/lib/dpdk.c
>> @@ -497,6 +497,14 @@ dpdk_init__(const struct smap *ovs_other_config)
>>
>> /* Finally, register the dpdk classes */
>> netdev_dpdk_register();
>> +
>> + bool multi_seg_mbufs_enable = smap_get_bool(ovs_other_config,
>> + "dpdk-multi-seg-mbufs", false);
>> + if (multi_seg_mbufs_enable) {
>> + VLOG_INFO("DPDK multi-segment mbufs enabled\n");
>> + netdev_dpdk_multi_segment_mbufs_enable();
>> + }
>> +
>> return true;
>> }
>>
>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>> index e58e7ac..96c2020 100644
>> --- a/lib/netdev-dpdk.c
>> +++ b/lib/netdev-dpdk.c
>> @@ -70,6 +70,7 @@ enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
>>
>> VLOG_DEFINE_THIS_MODULE(netdev_dpdk);
>> static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
>> +static bool dpdk_multi_segment_mbufs = false;
>>
>> #define DPDK_PORT_WATCHDOG_INTERVAL 5
>>
>> @@ -521,6 +522,12 @@ is_dpdk_class(const struct netdev_class *class)
>> || class->destruct == netdev_dpdk_vhost_destruct;
>> }
>>
>> +void
>> +netdev_dpdk_multi_segment_mbufs_enable(void)
>> +{
>> + dpdk_multi_segment_mbufs = true;
>> +}
>> +
>> /* DPDK NIC drivers allocate RX buffers at a particular granularity,
>> typically
>> * aligned at 1k or less. If a declared mbuf size is not a multiple of this
>> * value, insufficient buffers are allocated to accomodate the packet in its
>> @@ -636,14 +643,17 @@ dpdk_mp_sweep(void) OVS_REQUIRES(dpdk_mp_mutex)
>> }
>> }
>>
>> -/* Calculating the required number of mbufs differs depending on the
>> - * mempool model being used. Check if per port memory is in use before
>> - * calculating.
>> - */
>> +/* Calculating the required number of mbufs differs depending on the mempool
>> + * model (per port vs shared mempools) being used.
>> + * In case multi-segment mbufs are being used, the number of mbufs is also
>> + * increased, to account for the multiple mbufs needed to hold each packet's
>> + * data. */
>> static uint32_t
>> -dpdk_calculate_mbufs(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
>> +dpdk_calculate_mbufs(struct netdev_dpdk *dev, int mtu, uint32_t mbuf_size,
>> + bool per_port_mp)
>> {
>> uint32_t n_mbufs;
>> + uint16_t max_frame_len = 0;
>>
>> if (!per_port_mp) {
>> /* Shared memory are being used.
>> @@ -672,6 +682,22 @@ dpdk_calculate_mbufs(struct netdev_dpdk *dev, int mtu,
>> bool per_port_mp)
>> + MIN_NB_MBUF;
>> }
>>
>> + /* If multi-segment mbufs are used, we also increase the number of
>> + * mbufs used. This is done by calculating how many mbufs are needed to
>> + * hold the data on a single packet of MTU size. For example, for a
>> + * received packet of 9000B, 5 mbufs (9000 / 2048) are needed to hold
>> + * the data - 4 more than with single-mbufs (as mbufs' size is extended
>> + * to hold all data) */
>> + max_frame_len = MTU_TO_MAX_FRAME_LEN(dev->requested_mtu);
>> + if (dpdk_multi_segment_mbufs && mbuf_size < max_frame_len) {
>> + uint16_t nb_segs = max_frame_len / mbuf_size;
>> + if (max_frame_len % mbuf_size) {
>> + nb_segs += 1;
>> + }
>> +
>> + n_mbufs *= nb_segs;
>> + }
>> +
>> return n_mbufs;
>> }
>>
>> @@ -700,8 +726,12 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool
>> per_port_mp)
>>
>> /* Get the size of each mbuf, based on the MTU */
>> mbuf_size = dpdk_buf_size(dev->requested_mtu);
>> + /* multi-segment mbufs - use standard mbuf size */
>> + if (dpdk_multi_segment_mbufs) {
>> + mbuf_size = dpdk_buf_size(ETHER_MTU);
>> + }
>>
>> - n_mbufs = dpdk_calculate_mbufs(dev, mtu, per_port_mp);
>> + n_mbufs = dpdk_calculate_mbufs(dev, mtu, mbuf_size, per_port_mp);
>>
>> do {
>> /* Full DPDK memory pool name must be unique and cannot be
>> @@ -959,6 +989,7 @@ dpdk_eth_dev_port_config(struct netdev_dpdk *dev, int
>> n_rxq, int n_txq)
>> int diag = 0;
>> int i;
>> struct rte_eth_conf conf = port_conf;
>> + struct rte_eth_txconf txconf;
>> struct rte_eth_dev_info info;
>> uint16_t conf_mtu;
>>
>> @@ -975,6 +1006,18 @@ dpdk_eth_dev_port_config(struct netdev_dpdk *dev, int
>> n_rxq, int n_txq)
>> }
>> }
>>
>> + /* Multi-segment-mbuf-specific setup. */
>> + if (dpdk_multi_segment_mbufs) {
>> + /* DPDK PMDs typically attempt to use simple or vectorized
>> + * transmit functions, neither of which are compatible with
>> + * multi-segment mbufs. Ensure that these are disabled when
>> + * multi-segment mbufs are enabled.
>> + */
>> + rte_eth_dev_info_get(dev->port_id, &info);
>> + txconf = info.default_txconf;
>> + txconf.txq_flags &= ~ETH_TXQ_FLAGS_NOMULTSEGS;
>> + }
>> +
>> conf.intr_conf.lsc = dev->lsc_interrupt_mode;
>> conf.rxmode.hw_ip_checksum = (dev->hw_ol_features &
>> NETDEV_RX_CHECKSUM_OFFLOAD) != 0;
>> @@ -1019,7 +1062,9 @@ dpdk_eth_dev_port_config(struct netdev_dpdk *dev, int
>> n_rxq, int n_txq)
>>
>> for (i = 0; i < n_txq; i++) {
>> diag = rte_eth_tx_queue_setup(dev->port_id, i, dev->txq_size,
>> - dev->socket_id, NULL);
>> + dev->socket_id,
>> + dpdk_multi_segment_mbufs ? &txconf
>> + : NULL);
>> if (diag) {
>> VLOG_INFO("Interface %s unable to setup txq(%d): %s",
>> dev->up.name, i, rte_strerror(-diag));
>> @@ -4118,7 +4163,6 @@ unlock:
>> return err;
>> }
>>
>> -
>> /* Find rte_flow with @ufid */
>> static struct rte_flow *
>> ufid_to_rte_flow_find(const ovs_u128 *ufid) {
>> diff --git a/lib/netdev-dpdk.h b/lib/netdev-dpdk.h
>> index cc0501d..2bb9090 100644
>> --- a/lib/netdev-dpdk.h
>> +++ b/lib/netdev-dpdk.h
>> @@ -29,6 +29,7 @@ struct dp_packet;
>> .flow_put = netdev_dpdk_flow_put, \
>> .flow_del = netdev_dpdk_flow_del
>>
>> +void netdev_dpdk_multi_segment_mbufs_enable(void);
>> void netdev_dpdk_register(void);
>> void free_dpdk_buf(struct dp_packet *);
>>
>> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
>> index 66a8a3b..3a691f2 100644
>> --- a/vswitchd/vswitch.xml
>> +++ b/vswitchd/vswitch.xml
>> @@ -346,6 +346,28 @@
>> </p>
>> </column>
>>
>> + <column name="other_config" key="dpdk-multi-seg-mbufs"
>> + type='{"type": "boolean"}'>
>> + <p>
>> + Specifies if DPDK uses multi-segment mbufs for handling jumbo
>> frames.
>> + </p>
>> + <p>
>> + If true, DPDK allocates a single mempool per port, irrespective of
>> + the ports' requested MTU sizes. The elements of this mempool are
>> + 'standard'-sized mbufs (typically 2k MB), which may be chained
>> + together to accommodate jumbo frames. In this approach, each mbuf
>> + typically stores a fragment of the overall jumbo frame.
>> + </p>
>> + <p>
>> + If not specified, defaults to <code>false</code>, in which case,
>> the
>> + size of each mbuf within a DPDK port's mempool will be grown to
>> + accommodate jumbo frames within a single mbuf.
>> + </p>
>> + <p>
>> + Changing this value requires restarting the daemon.
>> + </p>
>> + </column>
>> +
>> <column name="other_config" key="vhost-sock-dir"
>> type='{"type": "string"}'>
>> <p>
>> --
>> 2.7.4
>>
>> _______________________________________________
>> dev mailing list
>> [email protected]
>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev