On Wed, Apr 13, 2022 at 3:24 PM Kevin Traynor <[email protected]> wrote: > > Shared memory mempools may be currently be shared between DPDK > ports based on port MTU and NUMA. With some hint from the user > we can increase the sharing on MTU and hence reduce memory > consumption in many cases. > > For example, a port with MTU 9000, uses a mempool with an > mbuf size based on 9000 MTU. A port with MTU 1500, uses a > different mempool with an mbuf size based on 1500 MTU. > > In this case, assuming same NUMA, both these ports could > share the 9000 MTU mempool. > > The user must give a hint as order of creation of ports and > setting of MTUs may vary and we need to ensure that upgrades > from older OVS versions do not require more memory. > > This scheme can also prevent multiple mempools being created > for cases where a port is added picking up a default MTU and > an appropriate mempool, but later has it's MTU changed to a > different value requiring a different mempool. > > Example usage: > > $ ovs-vsctl --no-wait set Open_vSwitch . \ > other_config:dpdk-shared-memory-config=9000,1500:1,6000:1 > > Port added on NUMA 0: > * MTU 1500, use mempool based on 9000 MTU > * MTU 6000, use mempool based on 9000 MTU > * MTU 9000, use mempool based on 9000 MTU > * MTU 9300, use mempool based on 9300 MTU (existing behaviour) > > Port added on NUMA 1: > * MTU 1500, use mempool based on 1500 MTU > * MTU 6000, use mempool based on 6000 MTU > * MTU 9000, use mempool based on 9000 MTU > * MTU 9300, use mempool based on 9300 MTU (existing behaviour) > > Default behaviour is unchanged and mempools are still only created > when needed.
That's an interesting feature. I have a first found of comments. My first concern is on moving the code to lib/dpdk*. The first patch of the series moves netdev notions to the "generic" dpdk code while there was an effort (from the introduction of the lib/dpdk* files) to keep things separated. I tried squashing both patches and moving everything in lib/netdev-dpdk.c, it builds fine though I did not test it yet. On the parameter name itself, dpdk-shared-memory-config is confusing. There is a shared memory in DPDK (for multiprocess support) but this parameter has nothing to do with it. We need at least a mention of mempool in the name, but simply going with "dpdk-shared-mempool-config" is still strange, because there can be mempools internally of dpdk. Maybe netdev-dpdk-shared-mempool-config? More comments inline: > > Signed-off-by: Kevin Traynor <[email protected]> > --- > Documentation/topics/dpdk/memory.rst | 43 +++++++++++ > lib/dpdk-stub.c | 8 +++ > lib/dpdk.c | 104 +++++++++++++++++++++++++++ > lib/dpdk.h | 2 + > lib/netdev-dpdk.c | 4 ++ > vswitchd/vswitch.xml | 37 ++++++++++ > 6 files changed, 198 insertions(+) > > diff --git a/Documentation/topics/dpdk/memory.rst > b/Documentation/topics/dpdk/memory.rst > index 7a9d02c18..8bf216fc6 100644 > --- a/Documentation/topics/dpdk/memory.rst > +++ b/Documentation/topics/dpdk/memory.rst > @@ -214,2 +214,45 @@ Example 3: (2 rxq, 2 PMD, 9000 MTU) > Mbuf size = 10176 Bytes > Memory required = 26656 * 10176 = 271 MB > + > +Shared Memory Configuration > +--------------------------- Not sure it is needed, but in the rest of this doc, sections start with an empty line after their title. > +In order to increase sharing of mempools, a user can configure the MTUs which > +mempools are based on by using ``dpdk-shared-memory-config``. > + > +A MTU configured bt the user is adjusted to an mbuf size used for mempool by* a* > +creation and stored. If a port is subsequently added that has an MTU which > can > +be accommodated by this mbuf size, it will be used for mempool > creation/reuse. > + > +This can increase sharing by consolidating mempools for ports with different > +MTUs which would otherwise use separate mempools. It can also help to remove > +the need for mempools being created after a port is added but before it's MTU > +is changed to a different value. > + > +For example, on a 2 NUMA system:: > + > + $ ovs-vsctl ovs-vsctl --no-wait set Open_vSwitch . \ > + other_config:dpdk-shared-memory-config=9000,1500:1,6000:1 > + > + > +In this case, OVS stores the mbuf sizes based on the following MTUs. > + > +* NUMA 0: 9000 > +* NUMA 1: 1500, 6000, 9000 > + > +Ports added will use mempools with the mbuf sizes based on the above MTUs > where > +possible. If there is more than one suitable, the one closest to the MTU will > +be selected. > + > +Port added on NUMA 0: > + > +* MTU 1500, use mempool based on 9000 MTU > +* MTU 6000, use mempool based on 9000 MTU > +* MTU 9000, use mempool based on 9000 MTU > +* MTU 9300, use mempool based on 9300 MTU (existing behaviour) > + > +Port added on NUMA 1: > + > +* MTU 1500, use mempool based on 1500 MTU > +* MTU 6000, use mempool based on 6000 MTU > +* MTU 9000, use mempool based on 9000 MTU > +* MTU 9300, use mempool based on 9300 MTU (existing behaviour) > diff --git a/lib/dpdk-stub.c b/lib/dpdk-stub.c > index 947af0abd..272357ac3 100644 > --- a/lib/dpdk-stub.c > +++ b/lib/dpdk-stub.c > @@ -101,2 +101,10 @@ dpdk_buf_size(int mtu OVS_UNUSED) > return 0; > } > + > +int > +dpdk_get_user_adjusted_mtu(int port_adj_mtu OVS_UNUSED, > + int port_mtu OVS_UNUSED, > + int port_socket_id OVS_UNUSED) > +{ > + return 0; > +} > diff --git a/lib/dpdk.c b/lib/dpdk.c > index e1836a168..90897437d 100644 > --- a/lib/dpdk.c > +++ b/lib/dpdk.c > @@ -36,4 +36,5 @@ > #include "netdev-offload-provider.h" > #include "openvswitch/dynamic-string.h" > +#include "openvswitch/ofp-parse.h" > #include "openvswitch/vlog.h" > #include "ovs-atomic.h" > @@ -58,4 +59,12 @@ static bool per_port_memory = false; /* Status of per port > memory support */ > static atomic_bool dpdk_initialized = ATOMIC_VAR_INIT(false); > > +struct user_mempool_config { > + int adj_mtu; > + int socket_id; > +}; > + > +static struct user_mempool_config *user_mempools = NULL; > +static int n_user_mempools; > + > static int > process_vhost_flags(char *flag, const char *default_val, int size, > @@ -344,4 +353,61 @@ malloc_dump_stats_wrapper(FILE *stream) > } > > +static int No strong opinion, but nothing is done with return value in caller. I would make this function void and add logs where parsing fails. > +parse_user_mempools_list(const char *mtus) > +{ > + char *list, *key, *value; > + int error = 0; > + > + if (!mtus) { > + return 0; > + } > + > + n_user_mempools = 0; > + list = xstrdup(mtus); list is leaked at the end of the function. > + > + while (ofputil_parse_key_value(&list, &key, &value)) { > + int socket_id, mtu, i, adj_mtu; > + > + if (!str_to_int(key, 0, &mtu) || mtu < 0) { > + error = EINVAL; > + break; Breaking means we keep in user_mempools[] what had been parsed of the passed string so far. With a half setup configuration, the behavior might be hard to understand for users. Inconsistency could be detected a long time after configuring too. I'd rather log an error here, and reset user_mempools[] and/or n_user_mempools. > + } > + > + if (!str_to_int(value, 0, &socket_id)) { > + /* No socket specified. It will apply for all numas. */ > + socket_id = INT_MAX; > + } else if (socket_id < 0) { > + error = EINVAL; Same comment here about break; > + break; > + } > + > + adj_mtu = FRAME_LEN_TO_MTU(dpdk_buf_size(mtu)); > + > + for (i = 0; i < n_user_mempools; i++) { > + if (user_mempools[i].adj_mtu == adj_mtu && > + (user_mempools[i].socket_id == INT_MAX || > + user_mempools[i].socket_id == socket_id)) { On this "duplicate" detection, do we need it? It won't handle the following case: If passing other_config:dpdk-shared-memory-config=9000,9000:0, we get: 2022-05-04T12:23:16.654Z|00028|netdev_dpdk|INFO|User configured shared memory mempool set for: MTU 9000, NUMA ALL. 2022-05-04T12:23:16.654Z|00029|netdev_dpdk|INFO|User configured shared memory mempool with mbuf size of 9326 already set that caters for: MTU 9000, NUMA 0. Otoh, passing other_config:dpdk-shared-memory-config=9000:0,9000: 2022-05-04T12:18:17.525Z|00028|netdev_dpdk|INFO|User configured shared memory mempool set for: MTU 9000, NUMA 0. 2022-05-04T12:18:17.525Z|00029|netdev_dpdk|INFO|User configured shared memory mempool set for: MTU 9000, NUMA ALL. If we keep it and, iirc the coding style guidelines, should it be? + if (user_mempools[i].adj_mtu == adj_mtu && + && (user_mempools[i].socket_id == INT_MAX + || user_mempools[i].socket_id == socket_id)) { > + /* Already have this adjusted mtu. */ > + VLOG_INFO("User configured shared memory mempool with " > + "mbuf size of %d already set that caters for: " > + "MTU %d, NUMA %s." ,adj_mtu, mtu, s/ ,/, / That's an info level log which means users will see it. Displaying the adjusted mtu could be confusing to a user who only feeds "normal" (from his pov) mtu. > + socket_id == INT_MAX ? "ALL" : value); > + break; > + } > + } > + if (i == n_user_mempools) { > + /* Existing entry does not cover this mtu, create a new one. */ > + user_mempools = xrealloc(user_mempools, (n_user_mempools + 1) * > + sizeof(struct user_mempool_config)); > + user_mempools[n_user_mempools].adj_mtu = adj_mtu; > + user_mempools[n_user_mempools++].socket_id = socket_id; No strong opinion, but I prefer to put the increment aside from filling the current entry. If a field is added later by an inattentive developer as: user_mempools[n_user_mempools++].socket_id = socket_id; + user_mempools[n_user_mempools++].new_field = new_field; It would be such a pity if reviewers missed that :-). > + VLOG_INFO("User configured shared memory mempool set for: " > + "MTU %d, NUMA %s.",mtu, > + socket_id == INT_MAX ? "ALL" : value); > + } > + } > + return error; > +} > + > static bool > dpdk_init__(const struct smap *ovs_other_config) > @@ -354,4 +420,6 @@ dpdk_init__(const struct smap *ovs_other_config) > struct ovs_numa_dump *affinity = NULL; > struct svec args = SVEC_EMPTY_INITIALIZER; > + const char *mempoolcfg = smap_get(ovs_other_config, > + "dpdk-shared-memory-config"); > > log_stream = fopencookie(NULL, "w+", dpdk_log_func); > @@ -406,4 +474,6 @@ dpdk_init__(const struct smap *ovs_other_config) > per_port_memory ? "enabled" : "disabled"); > > + parse_user_mempools_list(mempoolcfg); > + > svec_add(&args, ovs_get_program_name()); > construct_dpdk_args(ovs_other_config, &args); > @@ -653,2 +723,36 @@ dpdk_buf_size(int mtu) > + RTE_PKTMBUF_HEADROOM; > } > + > +int > +dpdk_get_user_adjusted_mtu(int port_adj_mtu, int port_mtu, int > port_socket_id) > +{ > + int best_adj_user_mtu = INT_MAX; > + > + for (unsigned i = 0; i < n_user_mempools; i++) { > + int user_adj_mtu, user_socket_id; > + > + user_adj_mtu = user_mempools[i].adj_mtu; > + user_socket_id = user_mempools[i].socket_id; > + if (port_adj_mtu > user_adj_mtu || > + (user_socket_id != INT_MAX && > + user_socket_id != port_socket_id)) { Same comment than above about coding style. > + continue; > + } > + if (user_adj_mtu - port_adj_mtu < best_adj_user_mtu - port_adj_mtu) { I must be missing something here. What is the point of substracting port_adj_mtu to both sides of the comparison? > + /* This is the is the lowest valid user MTU. */ > + best_adj_user_mtu = user_adj_mtu; > + } > + } > + > + if (best_adj_user_mtu == INT_MAX) { > + VLOG_DBG("No user configured shared memory mempool mbuf sizes found " > + "suitable for port with MTU %d, NUMA %d.", port_mtu, > + port_socket_id); > + best_adj_user_mtu = port_adj_mtu; > + } else { > + VLOG_DBG("Found user configured shared memory mempool with mbufs " > + "of size %d, suitable for port with MTU %d, NUMA %d.", > + best_adj_user_mtu, port_mtu, port_socket_id); > + } > + return best_adj_user_mtu; > +} > diff --git a/lib/dpdk.h b/lib/dpdk.h > index 8ec43f57c..f43b80ac0 100644 > --- a/lib/dpdk.h > +++ b/lib/dpdk.h > @@ -64,4 +64,6 @@ void print_dpdk_version(void); > void dpdk_status(const struct ovsrec_open_vswitch *); > uint32_t dpdk_buf_size(int mtu); > +int dpdk_get_user_adjusted_mtu(int port_adj_mtu, int port_mtu, > + int port_socket_id); > > #endif /* dpdk.h */ > diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c > index c9877addb..fbf8fea24 100644 > --- a/lib/netdev-dpdk.c > +++ b/lib/netdev-dpdk.c > @@ -766,4 +766,8 @@ dpdk_mp_get(struct netdev_dpdk *dev, int mtu, bool > per_port_mp) > * to see if reuse is possible. */ > if (!per_port_mp) { > + /* If user has provided defined mempools, check if one is suitable > + * and get new buffer size.*/ > + mtu = dpdk_get_user_adjusted_mtu(mtu, dev->requested_mtu, > + dev->requested_socket_id); > LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) { > if (dmp->socket_id == dev->requested_socket_id > diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml > index 0c6632617..925bc97bc 100644 > --- a/vswitchd/vswitch.xml > +++ b/vswitchd/vswitch.xml > @@ -491,4 +491,41 @@ > </column> > > + <column name="other_config" key="dpdk-shared-memory-config"> > + <p>Specifies dpdk shared memory mempool config.</p> > + <p>Value should be set in the following form:</p> > + <p> > + <code>other_config:dpdk-shared-mempory-config=< > + user-shared-mempool-mtu-list></code> > + </p> > + <p>where</p> > + <p> > + <ul> > + <li> > + <user-shared-mempool-mtu-list> ::= > + NULL | <non-empty-list> > + </li> > + <li> > + <non-empty-list> ::= <user-mtus> | > + <user-mtus> , > + <non-empty-list> > + </li> > + <li> > + <user-mtus> ::= <mtu-all-socket> | > + <mtu-socket-pair> > + </li> > + <li> > + <mtu-all-socket> ::= <mtu> > + </li> > + <li> > + <mtu-socket-pair> ::= <mtu> : > <socket-id> > + </li> > + </ul> > + </p> > + <p> > + Changing this value requires restarting the daemon if dpdk-init has > + already been set to true. > + </p> > + </column> > + > <column name="other_config" key="tx-flush-interval" > type='{"type": "integer", > -- > 2.34.1 > -- David Marchand _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
