On Wed, Jun 22, 2022 at 09:32:29AM +0000, Anurag Agarwal wrote: > Thanks for your feedback. Please find my comments inline. > > Regards, > Anurag > > > -----Original Message----- > > From: lic...@chinatelecom.cn <lic...@chinatelecom.cn> > > Sent: Friday, June 17, 2022 2:29 PM > > To: Anurag Agarwal <anura...@gmail.com> > > Cc: ovs-dev@openvswitch.org > > Subject: Re: [ovs-dev] [PATCH v4] dpif-netdev: Allow cross-NUMA polling on > > selected ports > > > > On Wed, Jun 08, 2022 at 03:58:31PM +0530, Anurag Agarwal wrote: > > > From: Jan Scheurich <jan.scheur...@ericsson.com> > > > > > > Today dpif-netdev considers PMD threads on a non-local NUMA node for > > > automatic assignment of the rxqs of a port only if there are no local,non- > > isolated PMDs. > > > > > > On typical servers with both physical ports on one NUMA node, this > > > often leaves the PMDs on the other NUMA node under-utilized, wasting CPU > > resources. > > > The alternative, to manually pin the rxqs to PMDs on remote NUMA > > > nodes, also has drawbacks as it limits OVS' ability to auto load-balance > > > the > > rxqs. > > > > > > This patch introduces a new interface configuration option to allow > > > ports to be automatically polled by PMDs on any NUMA node: > > > > > > ovs-vsctl set interface <Name> other_config:cross-numa-polling=true > > > > > > The group assignment algorithm now has the ability to select lowest > > > loaded PMD on any NUMA, and not just the local NUMA on which the rxq > > > of the port resides > > > > > > If this option is not present or set to false, legacy behaviour applies. > > > > > > Co-authored-by: Anurag Agarwal <anurag.agar...@ericsson.com> > > > Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com> > > > Signed-off-by: Anurag Agarwal <anurag.agar...@ericsson.com> > > > --- > > > > > > Changes in this patch: > > > - Addressed comments from Kevin Traynor > > > > > > Please refer this thread for an earlier discussion on this topic: > > > https://protect2.fireeye.com/v1/url?k=31323334-501cfaf3-313273af-45444 > > > 5554331-165fefd0f093b353&q=1&e=d638a27f-b902-46d8-883e- > > ed0c226be33c&u= > > > https%3A%2F%2Fmail.openvswitch.org%2Fpipermail%2Fovs-dev%2F2022- > > March% > > > 2F392310.html > > > --- > > > Documentation/topics/dpdk/pmd.rst | 23 ++++++ > > > lib/dpif-netdev.c | 130 ++++++++++++++++++++++-------- > > > tests/pmd.at | 38 +++++++++ > > > vswitchd/vswitch.xml | 20 +++++ > > > 4 files changed, 177 insertions(+), 34 deletions(-) > > > > > > diff --git a/Documentation/topics/dpdk/pmd.rst > > > b/Documentation/topics/dpdk/pmd.rst > > > index b259cc8b3..387f962d1 100644 > > > --- a/Documentation/topics/dpdk/pmd.rst > > > +++ b/Documentation/topics/dpdk/pmd.rst > > > @@ -99,6 +99,25 @@ core cycles for each Rx queue:: > > > > > > $ ovs-appctl dpif-netdev/pmd-rxq-show > > > > > > +Normally, Rx queues are assigned to PMD threads automatically. By > > > +default OVS only assigns Rx queues to PMD threads executing on the > > > +same NUMA node in order to avoid unnecessary latency for accessing > > > +packet buffers across the NUMA boundary. Typically this overhead is > > > +higher for vhostuser ports than for physical ports due to the packet > > > +copy that is done for all rx packets. > > > + > > > +On NUMA servers with physical ports only on one NUMA node, the > > > +NUMA-local polling policy can lead to an under-utilization of the PMD > > > +threads on the remote NUMA node. For the overall OVS performance it > > > +may in such cases be beneficial to utilize the spare capacity and > > > +allow polling of a physical port's rxqs across NUMA nodes despite the > > overhead involved. > > > +The policy can be set per port with the following configuration option:: > > > + > > > + $ ovs-vsctl set Interface <iface> \ > > > + other_config:cross-numa-polling=true|false > > > + > > > +The default value is false. > > > + > > > .. note:: > > > > > > A history of one minute is recorded and shown for each Rx queue to > > > allow for @@ -115,6 +134,10 @@ core cycles for each Rx queue:: > > > A ``overhead`` statistics is shown per PMD: it represents the number > > > of > > > cycles inherently consumed by the OVS PMD processing loop. > > > > > > +.. versionchanged:: 2.18.0 > > > + > > > + Added the interface parameter ``other_config:cross-numa-polling`` > > > + > > > Rx queue to PMD assignment takes place whenever there are > > > configuration changes or can be triggered by using:: > > > > > > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index > > > ff57b3961..86f88964b 100644 > > > --- a/lib/dpif-netdev.c > > > +++ b/lib/dpif-netdev.c > > > @@ -467,6 +467,7 @@ struct dp_netdev_port { > > > char *type; /* Port type as requested by user. */ > > > char *rxq_affinity_list; /* Requested affinity of rx queues. */ > > > enum txq_req_mode txq_requested_mode; > > > + bool cross_numa_polling; /* If true cross polling will be > > > enabled. */ > > > }; > > > > > > static bool dp_netdev_flow_ref(struct dp_netdev_flow *); @@ -2101,6 > > > +2102,7 @@ port_create(const char *devname, const char *type, > > > port->sf = NULL; > > > port->emc_enabled = true; > > > port->need_reconfigure = true; > > > + port->cross_numa_polling = false; > > > ovs_mutex_init(&port->txq_used_mutex); > > > > > > *portp = port; > > > @@ -5013,6 +5015,7 @@ dpif_netdev_port_set_config(struct dpif *dpif, > > odp_port_t port_no, > > > bool emc_enabled = smap_get_bool(cfg, "emc-enable", true); > > > const char *tx_steering_mode = smap_get(cfg, "tx-steering"); > > > enum txq_req_mode txq_mode; > > > + bool cross_numa_polling = smap_get_bool(cfg, > > > + "cross-numa-polling", false); > > > > > > ovs_rwlock_wrlock(&dp->port_rwlock); > > > error = get_port_by_number(dp, port_no, &port); @@ -5020,6 > > > +5023,14 @@ dpif_netdev_port_set_config(struct dpif *dpif, odp_port_t > > port_no, > > > goto unlock; > > > } > > > > > > + if (cross_numa_polling != port->cross_numa_polling) { > > > + port->cross_numa_polling = cross_numa_polling; > > > + VLOG_INFO("%s:cross-numa-polling has been %s.", > > > + netdev_get_name(port->netdev), > > > + cross_numa_polling? "enabled" : "disabled"); > > > + dp_netdev_request_reconfigure(dp); > > > + } > > > + > > > if (emc_enabled != port->emc_enabled) { > > > struct dp_netdev_pmd_thread *pmd; > > > struct ds ds = DS_EMPTY_INITIALIZER; @@ -5751,7 +5762,7 @@ > > > compare_rxq_cycles(const void *a, const void *b) > > > > > > static bool > > > sched_pmd_new_lowest(struct sched_pmd *current_lowest, struct > > sched_pmd *pmd, > > > - bool has_proc) > > > + int port_numa_id, bool has_proc) > > > { > > > uint64_t current_num, pmd_num; > > > > > > @@ -5767,16 +5778,22 @@ sched_pmd_new_lowest(struct sched_pmd > > *current_lowest, struct sched_pmd *pmd, > > > pmd_num = pmd->n_rxq; > > > } > > > > > > - if (pmd_num < current_num) { > > > - return true; > > > + if (pmd_num != current_num) { > > > + return (pmd_num < current_num) ? true : false; > > > } > > > - return false; > > > + > > > + /* In case of a tie, select the new PMD only if the existing PMD > > > + * assigned is from non-local NUMA > > > + */ > > > + return (current_lowest->numa->numa_id != port_numa_id && > > > + pmd->numa->numa_id == port_numa_id) ? true : false; > > > + > > > } > > > > > > static struct sched_pmd * > > > sched_pmd_get_lowest(struct sched_numa *numa, bool has_cyc) { > > > - struct sched_pmd *lowest_sched_pmd = NULL; > > > + struct sched_pmd *lowest_pmd = NULL; > > > > > > for (unsigned i = 0; i < numa->n_pmds; i++) { > > > struct sched_pmd *sched_pmd; > > > @@ -5785,11 +5802,11 @@ sched_pmd_get_lowest(struct sched_numa > > *numa, bool has_cyc) > > > if (sched_pmd->isolated) { > > > continue; > > > } > > > - if (sched_pmd_new_lowest(lowest_sched_pmd, sched_pmd, has_cyc)) { > > > - lowest_sched_pmd = sched_pmd; > > > + if (sched_pmd_new_lowest(lowest_pmd, sched_pmd, INT_MAX, > > has_cyc)) { > > > + lowest_pmd = sched_pmd; > > > } > > > } > > > - return lowest_sched_pmd; > > > + return lowest_pmd; > > > } > > > > > > /* > > > @@ -5891,6 +5908,32 @@ get_rxq_cyc_log(char *a, enum > > sched_assignment_type algo, uint64_t cycles) > > > return a; > > > } > > > > > > +static struct sched_pmd * > > > +sched_pmd_all_numa_get_lowest(struct sched_numa_list *numa_list, > > > + int port_numa_id, bool has_proc) { > > > + int n_numa; > > > + struct sched_numa *numa = NULL; > > > + struct sched_numa *last_numa = NULL; > > > + struct sched_pmd *lowest_pmd = NULL; > > > + struct sched_pmd *pmd; > > > + > > > + n_numa = sched_numa_list_count(numa_list); > > > + /* For all numas. */ > > > + for (int i = 0; i < n_numa; i++) { > > > + last_numa = numa; > > > + numa = sched_numa_list_next(numa_list, last_numa); > > > + > > > + /* Get the lowest pmd per numa. */ > > > + pmd = sched_pmd_get_lowest(numa, has_proc); > > > + > > > + /* Check if it's the lowest pmd for all numas. */ > > > + if (sched_pmd_new_lowest(lowest_pmd, pmd, port_numa_id, > > > has_proc)) > > { > > > + lowest_pmd = pmd; > > > + } > > > + } > > > + return lowest_pmd; > > > +} > > > + > > > static void > > > sched_numa_list_schedule(struct sched_numa_list *numa_list, > > > struct dp_netdev *dp, @@ -5991,6 +6034,7 @@ > > > sched_numa_list_schedule(struct sched_numa_list *numa_list, > > > struct sched_pmd *sched_pmd = NULL; > > > struct sched_numa *numa; > > > int port_numa_id; > > > + bool cross_numa = rxq->port->cross_numa_polling; > > > > It may be better to schedule non-cross-numa rxqs first, and then cross-numa > > rxqs. > > Otherwise, one numa may hold much more load because of later scheduled non- > > cross-numa rxqs. > > > For a RxQ assignment, least loaded PMD is chosen, it could be from NUMA0 or > NUMA1 based on where the least loaded PMD is present. > The RxQs themselves are sorted and assigned one after the other. They are not > assigned in any specific NUMA order.
Considering the following situation: We have 2 numa nodes, each numa node has 1 pmd. And we have 2 port(p0, p1), each port has 2 queues. p0 is configured as cross_numa, p1 is not. each queue's workload, rxq p0q0 p0q1 p1q0 p1q1 load 30 30 20 20 Based on your current implement, the assignment will be: p0q0 -> numa0 (30) p0q1 -> numa1 (30) p1q0 -> numa0 (30+20=50) p1q1 -> numa0 (50+20=70) As the result, numa0 holds 70% workload but numa1 holds only 30%. Because later assigned queues are numa affinity. > > > > > > uint64_t proc_cycles; > > > char rxq_cyc_log[MAX_RXQ_CYC_STRLEN]; > > > > > > @@ -6000,33 +6044,40 @@ sched_numa_list_schedule(struct > > sched_numa_list *numa_list, > > > start_logged = true; > > > } > > > > > > - /* Store the cycles for this rxq as we will log these later. */ > > > + /* Store the cycles for this rxq as we will need these later. > > > + */ > > > proc_cycles = dp_netdev_rxq_get_cycles(rxq, > > > RXQ_CYCLES_PROC_HIST); > > > > > > port_numa_id = netdev_get_numa_id(rxq->port->netdev); > > > > > > - /* Select numa. */ > > > - numa = sched_numa_list_lookup(numa_list, port_numa_id); > > > - > > > - /* Check if numa has no PMDs or no non-isolated PMDs. */ > > > - if (!numa || !sched_numa_noniso_pmd_count(numa)) { > > > - /* Unable to use this numa to find a PMD. */ > > > - numa = NULL; > > > - /* Find any numa with available PMDs. */ > > > - for (int j = 0; j < n_numa; j++) { > > > - numa = sched_numa_list_next(numa_list, last_cross_numa); > > > - last_cross_numa = numa; > > > - if (sched_numa_noniso_pmd_count(numa)) { > > > - break; > > > - } > > > + if (cross_numa && algo == SCHED_GROUP) { > > > + /* cross_numa polling enabled so find lowest loaded pmd > > > across > > > + * all numas. */ > > > + sched_pmd = sched_pmd_all_numa_get_lowest(numa_list, > > port_numa_id, > > > + proc_cycles); > > > + } else { > > > + /* Select numa. */ > > > + numa = sched_numa_list_lookup(numa_list, port_numa_id); > > > + > > > + /* Check if numa has no PMDs or no non-isolated PMDs. */ > > > + if (!numa || !sched_numa_noniso_pmd_count(numa)) { > > > + /* Unable to use this numa to find a PMD. */ > > > numa = NULL; > > > + /* Find any numa with available PMDs. */ > > > + for (int j = 0; j < n_numa; j++) { > > > + numa = sched_numa_list_next(numa_list, > > > last_cross_numa); > > > + last_cross_numa = numa; > > > + if (sched_numa_noniso_pmd_count(numa)) { > > > + break; > > > + } > > > + numa = NULL; > > > + } > > > } > > > - } > > > > > > - if (numa) { > > > - /* Select the PMD that should be used for this rxq. */ > > > - sched_pmd = sched_pmd_next(numa, algo, > > > - proc_cycles ? true : false); > > > + if (numa) { > > > + /* Select the PMD that should be used for this rxq. */ > > > + sched_pmd = sched_pmd_next(numa, algo, > > > + proc_cycles ? true : false); > > > + } > > > } > > > > > > /* Check that a pmd has been selected. */ @@ -6036,12 > > > +6087,23 @@ sched_numa_list_schedule(struct sched_numa_list *numa_list, > > > pmd_numa_id = sched_pmd->numa->numa_id; > > > /* Check if selected pmd numa matches port numa. */ > > > if (pmd_numa_id != port_numa_id) { > > > - VLOG(level, "There's no available (non-isolated) pmd > > > thread " > > > - "on numa node %d. Port \'%s\' rx queue %d > > > will " > > > - "be assigned to a pmd on numa node %d. " > > > - "This may lead to reduced performance.", > > > - port_numa_id, netdev_rxq_get_name(rxq->rx), > > > - netdev_rxq_get_queue_id(rxq->rx), > > > pmd_numa_id); > > > + if (cross_numa) { > > > + VLOG(level, "Cross-numa polling has been selected > > > for " > > > + "Port \'%s\' rx queue %d on numa node > > > %d. " > > > + "It will be assigned to a pmd on numa > > > node " > > > + "%d. This may lead to reduced > > > performance.", > > > + netdev_rxq_get_name(rxq->rx), > > > + netdev_rxq_get_queue_id(rxq->rx), > > > port_numa_id, > > > + pmd_numa_id); > > > + > > > + } else { > > > + VLOG(level, "There's no available (non-isolated) pmd > > > " > > > + "thread on numa node %d. Port \'%s\' rx > > > queue " > > > + "%d will be assigned to a pmd on numa > > > node " > > > + "%d. This may lead to reduced > > > performance.", > > > + port_numa_id, > > > netdev_rxq_get_name(rxq->rx), > > > + netdev_rxq_get_queue_id(rxq->rx), > > > pmd_numa_id); > > > + } > > > } > > > VLOG(level, "Core %2u on numa node %d assigned port \'%s\' " > > > "rx queue %d%s.", diff --git a/tests/pmd.at > > > b/tests/pmd.at index e6b173dab..5d2ec1382 100644 > > > --- a/tests/pmd.at > > > +++ b/tests/pmd.at > > > @@ -579,6 +579,44 @@ > > > icmp,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a > > > ,nw_src=10 > > > OVS_VSWITCHD_STOP > > > AT_CLEANUP > > > > > > +AT_SETUP([PMD - enable cross numa polling]) > > > +OVS_VSWITCHD_START([add-port br0 p0 \ > > > + -- set Interface p0 type=dummy-pmd options:n_rxq=4 \ > > > + -- set Interface p0 options:numa_id=0 \ > > > + -- set Open_vSwitch . other_config:pmd-cpu-mask=0x3 \ > > > + -- set open_vswitch . > > > other_config:pmd-rxq-assign=group], > > > + [], [], [--dummy-numa 0,1]) > > > + > > > +AT_CHECK([ovs-ofctl add-flow br0 action=controller]) > > > +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show | > > cut -f 3 -d ' ' | sort | uniq], [0], [dnl > > > +0 > > > +]) > > > + > > > +dnl Enable cross numa polling and check numa ids TMP=$(($(cat > > > +ovs-vswitchd.log | wc -l | tr -d [[:blank:]])+1)) > > > +PATTERN="cross-numa-polling has been enabled" > > > +AT_CHECK([ovs-vsctl set Interface p0 > > > +other_config:cross-numa-polling=true]) > > > +OVS_WAIT_UNTIL([tail -n +$TMP ovs-vswitchd.log | grep "$PATTERN"]) > > > + > > > +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show | > > cut -f 3 -d ' ' | sort | uniq], [0], [dnl > > > +0 > > > +1 > > > +]) > > > + > > > +dnl Disable cross numa polling and check numa ids TMP=$(($(cat > > > +ovs-vswitchd.log | wc -l | tr -d [[:blank:]])+1)) > > > +PATTERN="cross-numa-polling has been disabled" > > > +AT_CHECK([ovs-vsctl set Interface p0 > > > +other_config:cross-numa-polling=false]) > > > +OVS_WAIT_UNTIL([tail -n +$TMP ovs-vswitchd.log | grep "$PATTERN"]) > > > + > > > +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show | > > cut -f 3 -d ' ' | sort | uniq], [0], [dnl > > > +0 > > > +]) > > > + > > > +OVS_VSWITCHD_STOP(["/|WARN|/d"]) > > > +AT_CLEANUP > > > + > > > + > > > AT_SETUP([PMD - change numa node]) > > > OVS_VSWITCHD_START( > > > [add-port br0 p1 -- set Interface p1 type=dummy-pmd > > > ofport_request=1 options:n_rxq=2 -- \ diff --git > > > a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml index > > > cc1dd77ec..c2e948bd9 100644 > > > --- a/vswitchd/vswitch.xml > > > +++ b/vswitchd/vswitch.xml > > > @@ -3286,6 +3286,26 @@ ovs-vsctl add-port br0 p0 -- set Interface p0 > > type=patch options:peer=p1 \ > > > </p> > > > </column> > > > > > > + <column name="other_config" key="cross-numa-polling" > > > + type='{"type": "boolean"}'> > > > + <p> > > > + Specifies if the RX queues of the port can be automatically > > > assigned > > > + to PMD threads on any NUMA node or only on the local NUMA node > > > of > > > + the port. > > > + </p> > > > + <p> > > > + Polling of physical ports from a non-local PMD thread incurs > > > some > > > + performance penalty due to the access to packet data across > > > the NUMA > > > + barrier. This option can still increase the overall > > > performance if > > > + it allows better utilization of those non-local PMDs threads. > > > + It is most useful together with the auto load-balancing of RX > > > queues > > > + <ref table="Open_vSwitch" column="other_config" key="pmd-auto- > > lb"/> > > > + </p> > > > + <p> > > > + Defaults to false. > > > + </p> > > > + </column> > > > + > > > <column name="options" key="xdp-mode" > > > type='{"type": "string", > > > "enum": ["set", ["best-effort", > > > "native-with-zerocopy", > > > -- > > > 2.25.1 > > > > > > _______________________________________________ > > > dev mailing list > > > d...@openvswitch.org > > > https://protect2.fireeye.com/v1/url?k=31323334-501cfaf3-313273af-45444 > > > 5554331-abcfec870399ab89&q=1&e=d638a27f-b902-46d8-883e- > > ed0c226be33c&u= > > > https%3A%2F%2Fmail.openvswitch.org%2Fmailman%2Flistinfo%2Fovs-dev _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev