On 04/05/2022 14:59, David Marchand wrote:
On Wed, Apr 13, 2022 at 3:24 PM Kevin Traynor <[email protected]>
wrote:

Shared memory mempools may be currently be shared between DPDK ports based on port MTU and NUMA. With some hint from the user we
can increase the sharing on MTU and hence reduce memory consumption
in many cases.

For example, a port with MTU 9000, uses a mempool with an mbuf size
based on 9000 MTU. A port with MTU 1500, uses a different mempool
with an mbuf size based on 1500 MTU.

In this case, assuming same NUMA, both these ports could share the
9000 MTU mempool.

The user must give a hint as order of creation of ports and setting
of MTUs may vary and we need to ensure that upgrades from older OVS
versions do not require more memory.

This scheme can also prevent multiple mempools being created for
cases where a port is added picking up a default MTU and an
appropriate mempool, but later has it's MTU changed to a different
value requiring a different mempool.

Example usage:

$ ovs-vsctl --no-wait set Open_vSwitch . \ other_config:dpdk-shared-memory-config=9000,1500:1,6000:1

Port added on NUMA 0: * MTU 1500, use mempool based on 9000 MTU *
MTU 6000, use mempool based on 9000 MTU * MTU 9000, use mempool
based on 9000 MTU * MTU 9300, use mempool based on 9300 MTU
(existing behaviour)

Port added on NUMA 1: * MTU 1500, use mempool based on 1500 MTU *
MTU 6000, use mempool based on 6000 MTU * MTU 9000, use mempool
based on 9000 MTU * MTU 9300, use mempool based on 9300 MTU
(existing behaviour)

Default behaviour is unchanged and mempools are still only created when needed.

That's an interesting feature.

I have a first found of comments.


Hi David. Thanks for the review and comments.


My first concern is on moving the code to lib/dpdk*. The first patch
of the series moves netdev notions to the "generic" dpdk code while
there was an effort (from the introduction of the lib/dpdk* files) to
keep things separated. I tried squashing both patches and moving
everything in lib/netdev-dpdk.c, it builds fine though I did not test
it yet.


I was basing it from the "per-port-memory" but on balance I agree with you and it will reduce code churn which is never a bad thing.


On the parameter name itself, dpdk-shared-memory-config is
confusing.
There is a shared memory in DPDK (for multiprocess support) but this parameter has nothing to do with it. We need at least a mention of
mempool in the name, but simply going with
"dpdk-shared-mempool-config" is still strange, because there can be
mempools internally of dpdk. Maybe
netdev-dpdk-shared-mempool-config?


I agree s/memory/mempool/ sounds better. I had used 'memory' because the option to select shared/per-port mempools is 'per-port-memory=true/false'.

Anyway, I think it can be ok to use 'mempool' in the new command because it is more intuitive IMHO.

I'm not sure about the 'netdev-dpdk' prefix, it seems a bit wordy and getting into OVS layers when it's not needed. Maybe can just do like the per-port one and drop the prefix altogether. "shared-mempool-config" ?

(P.S. Naming is hard)


More comments inline:


Signed-off-by: Kevin Traynor <[email protected]> --- Documentation/topics/dpdk/memory.rst | 43 +++++++++++ lib/dpdk-stub.c | 8 +++ lib/dpdk.c
| 104 +++++++++++++++++++++++++++ lib/dpdk.h
| 2 + lib/netdev-dpdk.c | 4 ++ vswitchd/vswitch.xml | 37 ++++++++++ 6 files
changed, 198 insertions(+)

diff --git a/Documentation/topics/dpdk/memory.rst
b/Documentation/topics/dpdk/memory.rst index 7a9d02c18..8bf216fc6
100644 --- a/Documentation/topics/dpdk/memory.rst +++
b/Documentation/topics/dpdk/memory.rst @@ -214,2 +214,45 @@ Example
3: (2 rxq, 2 PMD, 9000 MTU) Mbuf size = 10176 Bytes Memory required
= 26656 * 10176 = 271 MB + +Shared Memory Configuration +---------------------------

Not sure it is needed, but in the rest of this doc, sections start with an empty line after their title.


ok, will keep it consistent.

+In order to increase sharing of mempools, a user can configure the
MTUs which +mempools are based on by using
``dpdk-shared-memory-config``. + +A MTU configured bt the user is
adjusted to an mbuf size used for mempool

by* a*


ack


+creation and stored. If a port is subsequently added that has an
MTU which can +be accommodated by this mbuf size, it will be used
for mempool creation/reuse. + +This can increase sharing by
consolidating mempools for ports with different +MTUs which would
otherwise use separate mempools. It can also help to remove +the
need for mempools being created after a port is added but before
it's MTU +is changed to a different value. + +For example, on a 2
NUMA system:: + +  $ ovs-vsctl ovs-vsctl --no-wait set Open_vSwitch
. \ + other_config:dpdk-shared-memory-config=9000,1500:1,6000:1 + + +In this case, OVS stores the mbuf sizes based on the following
MTUs. + +* NUMA 0: 9000 +* NUMA 1: 1500, 6000, 9000 + +Ports added
will use mempools with the mbuf sizes based on the above MTUs
where +possible. If there is more than one suitable, the one
closest to the MTU will +be selected. + +Port added on NUMA 0: + +*
MTU 1500, use mempool based on 9000 MTU +* MTU 6000, use mempool
based on 9000 MTU +* MTU 9000, use mempool based on 9000 MTU +* MTU
9300, use mempool based on 9300 MTU (existing behaviour) + +Port
added on NUMA 1: + +* MTU 1500, use mempool based on 1500 MTU +*
MTU 6000, use mempool based on 6000 MTU +* MTU 9000, use mempool
based on 9000 MTU +* MTU 9300, use mempool based on 9300 MTU
(existing behaviour) diff --git a/lib/dpdk-stub.c
b/lib/dpdk-stub.c index 947af0abd..272357ac3 100644 ---
a/lib/dpdk-stub.c +++ b/lib/dpdk-stub.c @@ -101,2 +101,10 @@
dpdk_buf_size(int mtu OVS_UNUSED) return 0; } + +int +dpdk_get_user_adjusted_mtu(int port_adj_mtu OVS_UNUSED, +
int port_mtu OVS_UNUSED, +                           int
port_socket_id OVS_UNUSED) +{ +    return 0; +} diff --git
a/lib/dpdk.c b/lib/dpdk.c index e1836a168..90897437d 100644 ---
a/lib/dpdk.c +++ b/lib/dpdk.c @@ -36,4 +36,5 @@ #include
"netdev-offload-provider.h" #include
"openvswitch/dynamic-string.h" +#include "openvswitch/ofp-parse.h" #include "openvswitch/vlog.h" #include "ovs-atomic.h" @@ -58,4
+59,12 @@ static bool per_port_memory = false; /* Status of per
port memory support */ static atomic_bool dpdk_initialized =
ATOMIC_VAR_INIT(false);

+struct user_mempool_config { +    int adj_mtu; +    int
socket_id; +}; + +static struct user_mempool_config *user_mempools
= NULL; +static int n_user_mempools; + static int process_vhost_flags(char *flag, const char *default_val, int size, @@ -344,4 +353,61 @@ malloc_dump_stats_wrapper(FILE *stream) }

+static int

No strong opinion, but nothing is done with return value in caller. I
would make this function void and add logs where parsing fails.


Good suggestion, I agree.


+parse_user_mempools_list(const char *mtus) +{ +    char *list,
*key, *value; +    int error = 0; + +    if (!mtus) { +
return 0; +    } + +    n_user_mempools = 0; +    list =
xstrdup(mtus);

list is leaked at the end of the function.


oops, thanks.


+ +    while (ofputil_parse_key_value(&list, &key, &value)) { +
int socket_id, mtu, i, adj_mtu; + +        if (!str_to_int(key, 0,
&mtu) || mtu < 0) { +            error = EINVAL; +
break;

Breaking means we keep in user_mempools[] what had been parsed of
the passed string so far. With a half setup configuration, the
behavior might be hard to understand for users. Inconsistency could
be detected a long time after configuring too.

I'd rather log an error here, and reset user_mempools[] and/or
n_user_mempools.


Right, so in that way it is a very clear status for the user. That makes sense.


+        } + +        if (!str_to_int(value, 0, &socket_id)) { +
/* No socket specified. It will apply for all numas. */ +
socket_id = INT_MAX; +        } else if (socket_id < 0) { +
error = EINVAL;

Same comment here about break;


ack


+            break; +        } + +        adj_mtu =
FRAME_LEN_TO_MTU(dpdk_buf_size(mtu)); + +        for (i = 0; i <
n_user_mempools; i++) { +            if (user_mempools[i].adj_mtu
== adj_mtu && +                    (user_mempools[i].socket_id ==
INT_MAX || +                    user_mempools[i].socket_id ==
socket_id)) {

On this "duplicate" detection, do we need it? It won't handle the
following case:

If passing other_config:dpdk-shared-memory-config=9000,9000:0, we
get: 2022-05-04T12:23:16.654Z|00028|netdev_dpdk|INFO|User configured
shared memory mempool set for: MTU 9000, NUMA ALL. 2022-05-04T12:23:16.654Z|00029|netdev_dpdk|INFO|User configured
shared memory mempool with mbuf size of 9326 already set that caters
for: MTU 9000, NUMA 0.

Otoh, passing other_config:dpdk-shared-memory-config=9000:0,9000: 2022-05-04T12:18:17.525Z|00028|netdev_dpdk|INFO|User configured shared memory mempool set for: MTU 9000, NUMA 0. 2022-05-04T12:18:17.525Z|00029|netdev_dpdk|INFO|User configured
shared memory mempool set for: MTU 9000, NUMA ALL.


That's a good point. It is probably simpler to just remove the duplicate checking, as there is an ALL option. Otherwise it would need to be tracking that a (system specific) number of numas in an ALL is catered for through multiple existing entries etc. In the end it will have the same result of finding an mtu/numa or not, but would add a lot of complexity for little value.


If we keep it and, iirc the coding style guidelines, should it be? +
if (user_mempools[i].adj_mtu == adj_mtu && +                &&
(user_mempools[i].socket_id == INT_MAX +                    ||
user_mempools[i].socket_id == socket_id)) {


ack (removed this check)


+                /* Already have this adjusted mtu. */ +
VLOG_INFO("User configured shared memory mempool with " +
"mbuf size of %d already set that caters for: " +
"MTU %d, NUMA %s." ,adj_mtu, mtu,

s/ ,/, /


ack (removed this log)

That's an info level log which means users will see it. Displaying
the adjusted mtu could be confusing to a user who only feeds "normal"
(from his pov) mtu.


ack. I removed this duplicate check and this log.

On the way the logs were, it had been done as a way of informing the user about why say '900,1000' did not create different mempools. i.e. because they rounded to the same adjusted mtu size.

If the duplicate check and log had of remained - based on your comment, I would have changed it to debug level. I had thought the user could reduce their args from this log, but it would be unnecessary and it's more resilient to any future internal adjustment changes that the user just lists the mtus they want.

(Writing so I'll remember this if the log ever comes back)


+                          socket_id == INT_MAX ? "ALL" : value); +
break; +            } +        } +        if (i == n_user_mempools)
{ +            /* Existing entry does not cover this mtu, create a
new one. */ +            user_mempools = xrealloc(user_mempools,
(n_user_mempools + 1) * +
sizeof(struct user_mempool_config)); +
user_mempools[n_user_mempools].adj_mtu = adj_mtu; +
user_mempools[n_user_mempools++].socket_id = socket_id;

No strong opinion, but I prefer to put the increment aside from filling the current entry.

If a field is added later by an inattentive developer as:

OVS doesn't have inattentive developers, you must be thinking of a different open source project ;-)

Seriously though, ack, will separate them.

user_mempools[n_user_mempools++].socket_id = socket_id; +
user_mempools[n_user_mempools++].new_field = new_field;

It would be such a pity if reviewers missed that :-).


+            VLOG_INFO("User configured shared memory mempool set
for: " +                      "MTU %d, NUMA %s.",mtu, +
socket_id == INT_MAX ? "ALL" : value); +        } +    } +
return error; +} + static bool dpdk_init__(const struct smap
*ovs_other_config) @@ -354,4 +420,6 @@ dpdk_init__(const struct
smap *ovs_other_config) struct ovs_numa_dump *affinity = NULL; struct svec args = SVEC_EMPTY_INITIALIZER; + const char
*mempoolcfg = smap_get(ovs_other_config, +
"dpdk-shared-memory-config");

log_stream = fopencookie(NULL, "w+", dpdk_log_func); @@ -406,4
+474,6 @@ dpdk_init__(const struct smap *ovs_other_config) per_port_memory ? "enabled" : "disabled");

+    parse_user_mempools_list(mempoolcfg); + svec_add(&args,
ovs_get_program_name()); construct_dpdk_args(ovs_other_config,
&args); @@ -653,2 +723,36 @@ dpdk_buf_size(int mtu) +
RTE_PKTMBUF_HEADROOM; } + +int +dpdk_get_user_adjusted_mtu(int
port_adj_mtu, int port_mtu, int port_socket_id) +{ +    int
best_adj_user_mtu = INT_MAX; + +    for (unsigned i = 0; i <
n_user_mempools; i++) { + int user_adj_mtu, user_socket_id; + + user_adj_mtu = user_mempools[i].adj_mtu; +
user_socket_id = user_mempools[i].socket_id; +        if
(port_adj_mtu > user_adj_mtu || +                (user_socket_id !=
INT_MAX && +                 user_socket_id != port_socket_id)) {

Same comment than above about coding style.


ack


+            continue; +        } +        if (user_adj_mtu -
port_adj_mtu < best_adj_user_mtu - port_adj_mtu) {

I must be missing something here. What is the point of substracting
port_adj_mtu to both sides of the comparison?


Ah, I was just thinking about finding the gap between each one and port_adj_mtu and then comparing them. You are right, we can skip the subtractions and just compare.

Just sent a v3, thanks.


+            /* This is the is the lowest valid user MTU. */ +
best_adj_user_mtu  = user_adj_mtu; +        } +    } + +    if
(best_adj_user_mtu == INT_MAX) { +        VLOG_DBG("No user
configured shared memory mempool mbuf sizes found " +
"suitable for port with MTU %d, NUMA %d.", port_mtu, +
port_socket_id); +        best_adj_user_mtu = port_adj_mtu; +    }
else { +        VLOG_DBG("Found user configured shared memory
mempool with mbufs " +                 "of size %d, suitable for
port with MTU %d, NUMA %d.", +                 best_adj_user_mtu,
port_mtu, port_socket_id); + } + return best_adj_user_mtu; +} diff --git a/lib/dpdk.h b/lib/dpdk.h index 8ec43f57c..f43b80ac0
100644 --- a/lib/dpdk.h +++ b/lib/dpdk.h @@ -64,4 +64,6 @@ void
print_dpdk_version(void); void dpdk_status(const struct
ovsrec_open_vswitch *); uint32_t dpdk_buf_size(int mtu); +int
dpdk_get_user_adjusted_mtu(int port_adj_mtu, int port_mtu, +
int port_socket_id);

#endif /* dpdk.h */ diff --git a/lib/netdev-dpdk.c
b/lib/netdev-dpdk.c index c9877addb..fbf8fea24 100644 ---
a/lib/netdev-dpdk.c +++ b/lib/netdev-dpdk.c @@ -766,4 +766,8 @@
dpdk_mp_get(struct netdev_dpdk *dev, int mtu, bool per_port_mp) *
to see if reuse is possible. */ if (!per_port_mp) { +        /* If
user has provided defined mempools, check if one is suitable +
* and get new buffer size.*/ +        mtu =
dpdk_get_user_adjusted_mtu(mtu, dev->requested_mtu, +
dev->requested_socket_id); LIST_FOR_EACH (dmp, list_node,
&dpdk_mp_list) { if (dmp->socket_id == dev->requested_socket_id diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml index
0c6632617..925bc97bc 100644 --- a/vswitchd/vswitch.xml +++
b/vswitchd/vswitch.xml @@ -491,4 +491,41 @@ </column>

+      <column name="other_config"
key="dpdk-shared-memory-config"> +              <p>Specifies dpdk
shared memory mempool config.</p> +              <p>Value should be
set in the following form:</p> +              <p> +
<code>other_config:dpdk-shared-mempory-config=&lt; +
user-shared-mempool-mtu-list&gt;</code> +              </p> +
<p>where</p> +              <p> +                <ul> +
<li> + &lt;user-shared-mempool-mtu-list&gt; ::= + NULL | &lt;non-empty-list&gt; +
</li> +                  <li> +
&lt;non-empty-list&gt; ::= &lt;user-mtus&gt; | +
&lt;user-mtus&gt; , +
&lt;non-empty-list&gt; +                  </li> +
<li> +                    &lt;user-mtus&gt; ::=
&lt;mtu-all-socket&gt; | +
&lt;mtu-socket-pair&gt; +                  </li> +
<li> +                    &lt;mtu-all-socket&gt; ::= &lt;mtu&gt; +
</li> +                  <li> +
&lt;mtu-socket-pair&gt; ::= &lt;mtu&gt; : &lt;socket-id&gt; +
</li> +                </ul> +              </p> +        <p> +
Changing this value requires restarting the daemon if dpdk-init
has +          already been set to true. +        </p> +
</column> + <column name="other_config" key="tx-flush-interval" type='{"type": "integer", -- 2.34.1




_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to