Thanks for your feedback. Please find my comments inline. Regards, Anurag
> -----Original Message----- > From: [email protected] <[email protected]> > Sent: Friday, June 17, 2022 2:29 PM > To: Anurag Agarwal <[email protected]> > Cc: [email protected] > Subject: Re: [ovs-dev] [PATCH v4] dpif-netdev: Allow cross-NUMA polling on > selected ports > > On Wed, Jun 08, 2022 at 03:58:31PM +0530, Anurag Agarwal wrote: > > From: Jan Scheurich <[email protected]> > > > > Today dpif-netdev considers PMD threads on a non-local NUMA node for > > automatic assignment of the rxqs of a port only if there are no local,non- > isolated PMDs. > > > > On typical servers with both physical ports on one NUMA node, this > > often leaves the PMDs on the other NUMA node under-utilized, wasting CPU > resources. > > The alternative, to manually pin the rxqs to PMDs on remote NUMA > > nodes, also has drawbacks as it limits OVS' ability to auto load-balance the > rxqs. > > > > This patch introduces a new interface configuration option to allow > > ports to be automatically polled by PMDs on any NUMA node: > > > > ovs-vsctl set interface <Name> other_config:cross-numa-polling=true > > > > The group assignment algorithm now has the ability to select lowest > > loaded PMD on any NUMA, and not just the local NUMA on which the rxq > > of the port resides > > > > If this option is not present or set to false, legacy behaviour applies. > > > > Co-authored-by: Anurag Agarwal <[email protected]> > > Signed-off-by: Jan Scheurich <[email protected]> > > Signed-off-by: Anurag Agarwal <[email protected]> > > --- > > > > Changes in this patch: > > - Addressed comments from Kevin Traynor > > > > Please refer this thread for an earlier discussion on this topic: > > https://protect2.fireeye.com/v1/url?k=31323334-501cfaf3-313273af-45444 > > 5554331-165fefd0f093b353&q=1&e=d638a27f-b902-46d8-883e- > ed0c226be33c&u= > > https%3A%2F%2Fmail.openvswitch.org%2Fpipermail%2Fovs-dev%2F2022- > March% > > 2F392310.html > > --- > > Documentation/topics/dpdk/pmd.rst | 23 ++++++ > > lib/dpif-netdev.c | 130 ++++++++++++++++++++++-------- > > tests/pmd.at | 38 +++++++++ > > vswitchd/vswitch.xml | 20 +++++ > > 4 files changed, 177 insertions(+), 34 deletions(-) > > > > diff --git a/Documentation/topics/dpdk/pmd.rst > > b/Documentation/topics/dpdk/pmd.rst > > index b259cc8b3..387f962d1 100644 > > --- a/Documentation/topics/dpdk/pmd.rst > > +++ b/Documentation/topics/dpdk/pmd.rst > > @@ -99,6 +99,25 @@ core cycles for each Rx queue:: > > > > $ ovs-appctl dpif-netdev/pmd-rxq-show > > > > +Normally, Rx queues are assigned to PMD threads automatically. By > > +default OVS only assigns Rx queues to PMD threads executing on the > > +same NUMA node in order to avoid unnecessary latency for accessing > > +packet buffers across the NUMA boundary. Typically this overhead is > > +higher for vhostuser ports than for physical ports due to the packet > > +copy that is done for all rx packets. > > + > > +On NUMA servers with physical ports only on one NUMA node, the > > +NUMA-local polling policy can lead to an under-utilization of the PMD > > +threads on the remote NUMA node. For the overall OVS performance it > > +may in such cases be beneficial to utilize the spare capacity and > > +allow polling of a physical port's rxqs across NUMA nodes despite the > overhead involved. > > +The policy can be set per port with the following configuration option:: > > + > > + $ ovs-vsctl set Interface <iface> \ > > + other_config:cross-numa-polling=true|false > > + > > +The default value is false. > > + > > .. note:: > > > > A history of one minute is recorded and shown for each Rx queue to > > allow for @@ -115,6 +134,10 @@ core cycles for each Rx queue:: > > A ``overhead`` statistics is shown per PMD: it represents the number of > > cycles inherently consumed by the OVS PMD processing loop. > > > > +.. versionchanged:: 2.18.0 > > + > > + Added the interface parameter ``other_config:cross-numa-polling`` > > + > > Rx queue to PMD assignment takes place whenever there are > > configuration changes or can be triggered by using:: > > > > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index > > ff57b3961..86f88964b 100644 > > --- a/lib/dpif-netdev.c > > +++ b/lib/dpif-netdev.c > > @@ -467,6 +467,7 @@ struct dp_netdev_port { > > char *type; /* Port type as requested by user. */ > > char *rxq_affinity_list; /* Requested affinity of rx queues. */ > > enum txq_req_mode txq_requested_mode; > > + bool cross_numa_polling; /* If true cross polling will be enabled. > > */ > > }; > > > > static bool dp_netdev_flow_ref(struct dp_netdev_flow *); @@ -2101,6 > > +2102,7 @@ port_create(const char *devname, const char *type, > > port->sf = NULL; > > port->emc_enabled = true; > > port->need_reconfigure = true; > > + port->cross_numa_polling = false; > > ovs_mutex_init(&port->txq_used_mutex); > > > > *portp = port; > > @@ -5013,6 +5015,7 @@ dpif_netdev_port_set_config(struct dpif *dpif, > odp_port_t port_no, > > bool emc_enabled = smap_get_bool(cfg, "emc-enable", true); > > const char *tx_steering_mode = smap_get(cfg, "tx-steering"); > > enum txq_req_mode txq_mode; > > + bool cross_numa_polling = smap_get_bool(cfg, > > + "cross-numa-polling", false); > > > > ovs_rwlock_wrlock(&dp->port_rwlock); > > error = get_port_by_number(dp, port_no, &port); @@ -5020,6 > > +5023,14 @@ dpif_netdev_port_set_config(struct dpif *dpif, odp_port_t > port_no, > > goto unlock; > > } > > > > + if (cross_numa_polling != port->cross_numa_polling) { > > + port->cross_numa_polling = cross_numa_polling; > > + VLOG_INFO("%s:cross-numa-polling has been %s.", > > + netdev_get_name(port->netdev), > > + cross_numa_polling? "enabled" : "disabled"); > > + dp_netdev_request_reconfigure(dp); > > + } > > + > > if (emc_enabled != port->emc_enabled) { > > struct dp_netdev_pmd_thread *pmd; > > struct ds ds = DS_EMPTY_INITIALIZER; @@ -5751,7 +5762,7 @@ > > compare_rxq_cycles(const void *a, const void *b) > > > > static bool > > sched_pmd_new_lowest(struct sched_pmd *current_lowest, struct > sched_pmd *pmd, > > - bool has_proc) > > + int port_numa_id, bool has_proc) > > { > > uint64_t current_num, pmd_num; > > > > @@ -5767,16 +5778,22 @@ sched_pmd_new_lowest(struct sched_pmd > *current_lowest, struct sched_pmd *pmd, > > pmd_num = pmd->n_rxq; > > } > > > > - if (pmd_num < current_num) { > > - return true; > > + if (pmd_num != current_num) { > > + return (pmd_num < current_num) ? true : false; > > } > > - return false; > > + > > + /* In case of a tie, select the new PMD only if the existing PMD > > + * assigned is from non-local NUMA > > + */ > > + return (current_lowest->numa->numa_id != port_numa_id && > > + pmd->numa->numa_id == port_numa_id) ? true : false; > > + > > } > > > > static struct sched_pmd * > > sched_pmd_get_lowest(struct sched_numa *numa, bool has_cyc) { > > - struct sched_pmd *lowest_sched_pmd = NULL; > > + struct sched_pmd *lowest_pmd = NULL; > > > > for (unsigned i = 0; i < numa->n_pmds; i++) { > > struct sched_pmd *sched_pmd; > > @@ -5785,11 +5802,11 @@ sched_pmd_get_lowest(struct sched_numa > *numa, bool has_cyc) > > if (sched_pmd->isolated) { > > continue; > > } > > - if (sched_pmd_new_lowest(lowest_sched_pmd, sched_pmd, has_cyc)) { > > - lowest_sched_pmd = sched_pmd; > > + if (sched_pmd_new_lowest(lowest_pmd, sched_pmd, INT_MAX, > has_cyc)) { > > + lowest_pmd = sched_pmd; > > } > > } > > - return lowest_sched_pmd; > > + return lowest_pmd; > > } > > > > /* > > @@ -5891,6 +5908,32 @@ get_rxq_cyc_log(char *a, enum > sched_assignment_type algo, uint64_t cycles) > > return a; > > } > > > > +static struct sched_pmd * > > +sched_pmd_all_numa_get_lowest(struct sched_numa_list *numa_list, > > + int port_numa_id, bool has_proc) { > > + int n_numa; > > + struct sched_numa *numa = NULL; > > + struct sched_numa *last_numa = NULL; > > + struct sched_pmd *lowest_pmd = NULL; > > + struct sched_pmd *pmd; > > + > > + n_numa = sched_numa_list_count(numa_list); > > + /* For all numas. */ > > + for (int i = 0; i < n_numa; i++) { > > + last_numa = numa; > > + numa = sched_numa_list_next(numa_list, last_numa); > > + > > + /* Get the lowest pmd per numa. */ > > + pmd = sched_pmd_get_lowest(numa, has_proc); > > + > > + /* Check if it's the lowest pmd for all numas. */ > > + if (sched_pmd_new_lowest(lowest_pmd, pmd, port_numa_id, has_proc)) > { > > + lowest_pmd = pmd; > > + } > > + } > > + return lowest_pmd; > > +} > > + > > static void > > sched_numa_list_schedule(struct sched_numa_list *numa_list, > > struct dp_netdev *dp, @@ -5991,6 +6034,7 @@ > > sched_numa_list_schedule(struct sched_numa_list *numa_list, > > struct sched_pmd *sched_pmd = NULL; > > struct sched_numa *numa; > > int port_numa_id; > > + bool cross_numa = rxq->port->cross_numa_polling; > > It may be better to schedule non-cross-numa rxqs first, and then cross-numa > rxqs. > Otherwise, one numa may hold much more load because of later scheduled non- > cross-numa rxqs. For a RxQ assignment, least loaded PMD is chosen, it could be from NUMA0 or NUMA1 based on where the least loaded PMD is present. The RxQs themselves are sorted and assigned one after the other. They are not assigned in any specific NUMA order. > > > uint64_t proc_cycles; > > char rxq_cyc_log[MAX_RXQ_CYC_STRLEN]; > > > > @@ -6000,33 +6044,40 @@ sched_numa_list_schedule(struct > sched_numa_list *numa_list, > > start_logged = true; > > } > > > > - /* Store the cycles for this rxq as we will log these later. */ > > + /* Store the cycles for this rxq as we will need these later. > > + */ > > proc_cycles = dp_netdev_rxq_get_cycles(rxq, > > RXQ_CYCLES_PROC_HIST); > > > > port_numa_id = netdev_get_numa_id(rxq->port->netdev); > > > > - /* Select numa. */ > > - numa = sched_numa_list_lookup(numa_list, port_numa_id); > > - > > - /* Check if numa has no PMDs or no non-isolated PMDs. */ > > - if (!numa || !sched_numa_noniso_pmd_count(numa)) { > > - /* Unable to use this numa to find a PMD. */ > > - numa = NULL; > > - /* Find any numa with available PMDs. */ > > - for (int j = 0; j < n_numa; j++) { > > - numa = sched_numa_list_next(numa_list, last_cross_numa); > > - last_cross_numa = numa; > > - if (sched_numa_noniso_pmd_count(numa)) { > > - break; > > - } > > + if (cross_numa && algo == SCHED_GROUP) { > > + /* cross_numa polling enabled so find lowest loaded pmd across > > + * all numas. */ > > + sched_pmd = sched_pmd_all_numa_get_lowest(numa_list, > port_numa_id, > > + proc_cycles); > > + } else { > > + /* Select numa. */ > > + numa = sched_numa_list_lookup(numa_list, port_numa_id); > > + > > + /* Check if numa has no PMDs or no non-isolated PMDs. */ > > + if (!numa || !sched_numa_noniso_pmd_count(numa)) { > > + /* Unable to use this numa to find a PMD. */ > > numa = NULL; > > + /* Find any numa with available PMDs. */ > > + for (int j = 0; j < n_numa; j++) { > > + numa = sched_numa_list_next(numa_list, > > last_cross_numa); > > + last_cross_numa = numa; > > + if (sched_numa_noniso_pmd_count(numa)) { > > + break; > > + } > > + numa = NULL; > > + } > > } > > - } > > > > - if (numa) { > > - /* Select the PMD that should be used for this rxq. */ > > - sched_pmd = sched_pmd_next(numa, algo, > > - proc_cycles ? true : false); > > + if (numa) { > > + /* Select the PMD that should be used for this rxq. */ > > + sched_pmd = sched_pmd_next(numa, algo, > > + proc_cycles ? true : false); > > + } > > } > > > > /* Check that a pmd has been selected. */ @@ -6036,12 > > +6087,23 @@ sched_numa_list_schedule(struct sched_numa_list *numa_list, > > pmd_numa_id = sched_pmd->numa->numa_id; > > /* Check if selected pmd numa matches port numa. */ > > if (pmd_numa_id != port_numa_id) { > > - VLOG(level, "There's no available (non-isolated) pmd > > thread " > > - "on numa node %d. Port \'%s\' rx queue %d will > > " > > - "be assigned to a pmd on numa node %d. " > > - "This may lead to reduced performance.", > > - port_numa_id, netdev_rxq_get_name(rxq->rx), > > - netdev_rxq_get_queue_id(rxq->rx), pmd_numa_id); > > + if (cross_numa) { > > + VLOG(level, "Cross-numa polling has been selected for " > > + "Port \'%s\' rx queue %d on numa node %d. " > > + "It will be assigned to a pmd on numa node > > " > > + "%d. This may lead to reduced > > performance.", > > + netdev_rxq_get_name(rxq->rx), > > + netdev_rxq_get_queue_id(rxq->rx), > > port_numa_id, > > + pmd_numa_id); > > + > > + } else { > > + VLOG(level, "There's no available (non-isolated) pmd " > > + "thread on numa node %d. Port \'%s\' rx > > queue " > > + "%d will be assigned to a pmd on numa node > > " > > + "%d. This may lead to reduced > > performance.", > > + port_numa_id, netdev_rxq_get_name(rxq->rx), > > + netdev_rxq_get_queue_id(rxq->rx), > > pmd_numa_id); > > + } > > } > > VLOG(level, "Core %2u on numa node %d assigned port \'%s\' " > > "rx queue %d%s.", diff --git a/tests/pmd.at > > b/tests/pmd.at index e6b173dab..5d2ec1382 100644 > > --- a/tests/pmd.at > > +++ b/tests/pmd.at > > @@ -579,6 +579,44 @@ > > icmp,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a > > ,nw_src=10 > > OVS_VSWITCHD_STOP > > AT_CLEANUP > > > > +AT_SETUP([PMD - enable cross numa polling]) > > +OVS_VSWITCHD_START([add-port br0 p0 \ > > + -- set Interface p0 type=dummy-pmd options:n_rxq=4 \ > > + -- set Interface p0 options:numa_id=0 \ > > + -- set Open_vSwitch . other_config:pmd-cpu-mask=0x3 \ > > + -- set open_vswitch . > > other_config:pmd-rxq-assign=group], > > + [], [], [--dummy-numa 0,1]) > > + > > +AT_CHECK([ovs-ofctl add-flow br0 action=controller]) > > +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show | > cut -f 3 -d ' ' | sort | uniq], [0], [dnl > > +0 > > +]) > > + > > +dnl Enable cross numa polling and check numa ids TMP=$(($(cat > > +ovs-vswitchd.log | wc -l | tr -d [[:blank:]])+1)) > > +PATTERN="cross-numa-polling has been enabled" > > +AT_CHECK([ovs-vsctl set Interface p0 > > +other_config:cross-numa-polling=true]) > > +OVS_WAIT_UNTIL([tail -n +$TMP ovs-vswitchd.log | grep "$PATTERN"]) > > + > > +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show | > cut -f 3 -d ' ' | sort | uniq], [0], [dnl > > +0 > > +1 > > +]) > > + > > +dnl Disable cross numa polling and check numa ids TMP=$(($(cat > > +ovs-vswitchd.log | wc -l | tr -d [[:blank:]])+1)) > > +PATTERN="cross-numa-polling has been disabled" > > +AT_CHECK([ovs-vsctl set Interface p0 > > +other_config:cross-numa-polling=false]) > > +OVS_WAIT_UNTIL([tail -n +$TMP ovs-vswitchd.log | grep "$PATTERN"]) > > + > > +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show | > cut -f 3 -d ' ' | sort | uniq], [0], [dnl > > +0 > > +]) > > + > > +OVS_VSWITCHD_STOP(["/|WARN|/d"]) > > +AT_CLEANUP > > + > > + > > AT_SETUP([PMD - change numa node]) > > OVS_VSWITCHD_START( > > [add-port br0 p1 -- set Interface p1 type=dummy-pmd > > ofport_request=1 options:n_rxq=2 -- \ diff --git > > a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml index > > cc1dd77ec..c2e948bd9 100644 > > --- a/vswitchd/vswitch.xml > > +++ b/vswitchd/vswitch.xml > > @@ -3286,6 +3286,26 @@ ovs-vsctl add-port br0 p0 -- set Interface p0 > type=patch options:peer=p1 \ > > </p> > > </column> > > > > + <column name="other_config" key="cross-numa-polling" > > + type='{"type": "boolean"}'> > > + <p> > > + Specifies if the RX queues of the port can be automatically > > assigned > > + to PMD threads on any NUMA node or only on the local NUMA node of > > + the port. > > + </p> > > + <p> > > + Polling of physical ports from a non-local PMD thread incurs some > > + performance penalty due to the access to packet data across the > > NUMA > > + barrier. This option can still increase the overall performance > > if > > + it allows better utilization of those non-local PMDs threads. > > + It is most useful together with the auto load-balancing of RX > > queues > > + <ref table="Open_vSwitch" column="other_config" key="pmd-auto- > lb"/> > > + </p> > > + <p> > > + Defaults to false. > > + </p> > > + </column> > > + > > <column name="options" key="xdp-mode" > > type='{"type": "string", > > "enum": ["set", ["best-effort", > > "native-with-zerocopy", > > -- > > 2.25.1 > > > > _______________________________________________ > > dev mailing list > > [email protected] > > https://protect2.fireeye.com/v1/url?k=31323334-501cfaf3-313273af-45444 > > 5554331-abcfec870399ab89&q=1&e=d638a27f-b902-46d8-883e- > ed0c226be33c&u= > > https%3A%2F%2Fmail.openvswitch.org%2Fmailman%2Flistinfo%2Fovs-dev _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
