On Wed, Jun 22, 2022 at 09:32:29AM +0000, Anurag Agarwal wrote:
> Thanks for your feedback. Please find my comments inline.
>
> Regards,
> Anurag
>
> > -----Original Message-----
> > From: [email protected] <[email protected]>
> > Sent: Friday, June 17, 2022 2:29 PM
> > To: Anurag Agarwal <[email protected]>
> > Cc: [email protected]
> > Subject: Re: [ovs-dev] [PATCH v4] dpif-netdev: Allow cross-NUMA polling on
> > selected ports
> >
> > On Wed, Jun 08, 2022 at 03:58:31PM +0530, Anurag Agarwal wrote:
> > > From: Jan Scheurich <[email protected]>
> > >
> > > Today dpif-netdev considers PMD threads on a non-local NUMA node for
> > > automatic assignment of the rxqs of a port only if there are no local,non-
> > isolated PMDs.
> > >
> > > On typical servers with both physical ports on one NUMA node, this
> > > often leaves the PMDs on the other NUMA node under-utilized, wasting CPU
> > resources.
> > > The alternative, to manually pin the rxqs to PMDs on remote NUMA
> > > nodes, also has drawbacks as it limits OVS' ability to auto load-balance
> > > the
> > rxqs.
> > >
> > > This patch introduces a new interface configuration option to allow
> > > ports to be automatically polled by PMDs on any NUMA node:
> > >
> > > ovs-vsctl set interface <Name> other_config:cross-numa-polling=true
> > >
> > > The group assignment algorithm now has the ability to select lowest
> > > loaded PMD on any NUMA, and not just the local NUMA on which the rxq
> > > of the port resides
> > >
> > > If this option is not present or set to false, legacy behaviour applies.
> > >
> > > Co-authored-by: Anurag Agarwal <[email protected]>
> > > Signed-off-by: Jan Scheurich <[email protected]>
> > > Signed-off-by: Anurag Agarwal <[email protected]>
> > > ---
> > >
> > > Changes in this patch:
> > > - Addressed comments from Kevin Traynor
> > >
> > > Please refer this thread for an earlier discussion on this topic:
> > > https://protect2.fireeye.com/v1/url?k=31323334-501cfaf3-313273af-45444
> > > 5554331-165fefd0f093b353&q=1&e=d638a27f-b902-46d8-883e-
> > ed0c226be33c&u=
> > > https%3A%2F%2Fmail.openvswitch.org%2Fpipermail%2Fovs-dev%2F2022-
> > March%
> > > 2F392310.html
> > > ---
> > > Documentation/topics/dpdk/pmd.rst | 23 ++++++
> > > lib/dpif-netdev.c | 130 ++++++++++++++++++++++--------
> > > tests/pmd.at | 38 +++++++++
> > > vswitchd/vswitch.xml | 20 +++++
> > > 4 files changed, 177 insertions(+), 34 deletions(-)
> > >
> > > diff --git a/Documentation/topics/dpdk/pmd.rst
> > > b/Documentation/topics/dpdk/pmd.rst
> > > index b259cc8b3..387f962d1 100644
> > > --- a/Documentation/topics/dpdk/pmd.rst
> > > +++ b/Documentation/topics/dpdk/pmd.rst
> > > @@ -99,6 +99,25 @@ core cycles for each Rx queue::
> > >
> > > $ ovs-appctl dpif-netdev/pmd-rxq-show
> > >
> > > +Normally, Rx queues are assigned to PMD threads automatically. By
> > > +default OVS only assigns Rx queues to PMD threads executing on the
> > > +same NUMA node in order to avoid unnecessary latency for accessing
> > > +packet buffers across the NUMA boundary. Typically this overhead is
> > > +higher for vhostuser ports than for physical ports due to the packet
> > > +copy that is done for all rx packets.
> > > +
> > > +On NUMA servers with physical ports only on one NUMA node, the
> > > +NUMA-local polling policy can lead to an under-utilization of the PMD
> > > +threads on the remote NUMA node. For the overall OVS performance it
> > > +may in such cases be beneficial to utilize the spare capacity and
> > > +allow polling of a physical port's rxqs across NUMA nodes despite the
> > overhead involved.
> > > +The policy can be set per port with the following configuration option::
> > > +
> > > + $ ovs-vsctl set Interface <iface> \
> > > + other_config:cross-numa-polling=true|false
> > > +
> > > +The default value is false.
> > > +
> > > .. note::
> > >
> > > A history of one minute is recorded and shown for each Rx queue to
> > > allow for @@ -115,6 +134,10 @@ core cycles for each Rx queue::
> > > A ``overhead`` statistics is shown per PMD: it represents the number
> > > of
> > > cycles inherently consumed by the OVS PMD processing loop.
> > >
> > > +.. versionchanged:: 2.18.0
> > > +
> > > + Added the interface parameter ``other_config:cross-numa-polling``
> > > +
> > > Rx queue to PMD assignment takes place whenever there are
> > > configuration changes or can be triggered by using::
> > >
> > > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
> > > ff57b3961..86f88964b 100644
> > > --- a/lib/dpif-netdev.c
> > > +++ b/lib/dpif-netdev.c
> > > @@ -467,6 +467,7 @@ struct dp_netdev_port {
> > > char *type; /* Port type as requested by user. */
> > > char *rxq_affinity_list; /* Requested affinity of rx queues. */
> > > enum txq_req_mode txq_requested_mode;
> > > + bool cross_numa_polling; /* If true cross polling will be
> > > enabled. */
> > > };
> > >
> > > static bool dp_netdev_flow_ref(struct dp_netdev_flow *); @@ -2101,6
> > > +2102,7 @@ port_create(const char *devname, const char *type,
> > > port->sf = NULL;
> > > port->emc_enabled = true;
> > > port->need_reconfigure = true;
> > > + port->cross_numa_polling = false;
> > > ovs_mutex_init(&port->txq_used_mutex);
> > >
> > > *portp = port;
> > > @@ -5013,6 +5015,7 @@ dpif_netdev_port_set_config(struct dpif *dpif,
> > odp_port_t port_no,
> > > bool emc_enabled = smap_get_bool(cfg, "emc-enable", true);
> > > const char *tx_steering_mode = smap_get(cfg, "tx-steering");
> > > enum txq_req_mode txq_mode;
> > > + bool cross_numa_polling = smap_get_bool(cfg,
> > > + "cross-numa-polling", false);
> > >
> > > ovs_rwlock_wrlock(&dp->port_rwlock);
> > > error = get_port_by_number(dp, port_no, &port); @@ -5020,6
> > > +5023,14 @@ dpif_netdev_port_set_config(struct dpif *dpif, odp_port_t
> > port_no,
> > > goto unlock;
> > > }
> > >
> > > + if (cross_numa_polling != port->cross_numa_polling) {
> > > + port->cross_numa_polling = cross_numa_polling;
> > > + VLOG_INFO("%s:cross-numa-polling has been %s.",
> > > + netdev_get_name(port->netdev),
> > > + cross_numa_polling? "enabled" : "disabled");
> > > + dp_netdev_request_reconfigure(dp);
> > > + }
> > > +
> > > if (emc_enabled != port->emc_enabled) {
> > > struct dp_netdev_pmd_thread *pmd;
> > > struct ds ds = DS_EMPTY_INITIALIZER; @@ -5751,7 +5762,7 @@
> > > compare_rxq_cycles(const void *a, const void *b)
> > >
> > > static bool
> > > sched_pmd_new_lowest(struct sched_pmd *current_lowest, struct
> > sched_pmd *pmd,
> > > - bool has_proc)
> > > + int port_numa_id, bool has_proc)
> > > {
> > > uint64_t current_num, pmd_num;
> > >
> > > @@ -5767,16 +5778,22 @@ sched_pmd_new_lowest(struct sched_pmd
> > *current_lowest, struct sched_pmd *pmd,
> > > pmd_num = pmd->n_rxq;
> > > }
> > >
> > > - if (pmd_num < current_num) {
> > > - return true;
> > > + if (pmd_num != current_num) {
> > > + return (pmd_num < current_num) ? true : false;
> > > }
> > > - return false;
> > > +
> > > + /* In case of a tie, select the new PMD only if the existing PMD
> > > + * assigned is from non-local NUMA
> > > + */
> > > + return (current_lowest->numa->numa_id != port_numa_id &&
> > > + pmd->numa->numa_id == port_numa_id) ? true : false;
> > > +
> > > }
> > >
> > > static struct sched_pmd *
> > > sched_pmd_get_lowest(struct sched_numa *numa, bool has_cyc) {
> > > - struct sched_pmd *lowest_sched_pmd = NULL;
> > > + struct sched_pmd *lowest_pmd = NULL;
> > >
> > > for (unsigned i = 0; i < numa->n_pmds; i++) {
> > > struct sched_pmd *sched_pmd;
> > > @@ -5785,11 +5802,11 @@ sched_pmd_get_lowest(struct sched_numa
> > *numa, bool has_cyc)
> > > if (sched_pmd->isolated) {
> > > continue;
> > > }
> > > - if (sched_pmd_new_lowest(lowest_sched_pmd, sched_pmd, has_cyc)) {
> > > - lowest_sched_pmd = sched_pmd;
> > > + if (sched_pmd_new_lowest(lowest_pmd, sched_pmd, INT_MAX,
> > has_cyc)) {
> > > + lowest_pmd = sched_pmd;
> > > }
> > > }
> > > - return lowest_sched_pmd;
> > > + return lowest_pmd;
> > > }
> > >
> > > /*
> > > @@ -5891,6 +5908,32 @@ get_rxq_cyc_log(char *a, enum
> > sched_assignment_type algo, uint64_t cycles)
> > > return a;
> > > }
> > >
> > > +static struct sched_pmd *
> > > +sched_pmd_all_numa_get_lowest(struct sched_numa_list *numa_list,
> > > + int port_numa_id, bool has_proc) {
> > > + int n_numa;
> > > + struct sched_numa *numa = NULL;
> > > + struct sched_numa *last_numa = NULL;
> > > + struct sched_pmd *lowest_pmd = NULL;
> > > + struct sched_pmd *pmd;
> > > +
> > > + n_numa = sched_numa_list_count(numa_list);
> > > + /* For all numas. */
> > > + for (int i = 0; i < n_numa; i++) {
> > > + last_numa = numa;
> > > + numa = sched_numa_list_next(numa_list, last_numa);
> > > +
> > > + /* Get the lowest pmd per numa. */
> > > + pmd = sched_pmd_get_lowest(numa, has_proc);
> > > +
> > > + /* Check if it's the lowest pmd for all numas. */
> > > + if (sched_pmd_new_lowest(lowest_pmd, pmd, port_numa_id,
> > > has_proc))
> > {
> > > + lowest_pmd = pmd;
> > > + }
> > > + }
> > > + return lowest_pmd;
> > > +}
> > > +
> > > static void
> > > sched_numa_list_schedule(struct sched_numa_list *numa_list,
> > > struct dp_netdev *dp, @@ -5991,6 +6034,7 @@
> > > sched_numa_list_schedule(struct sched_numa_list *numa_list,
> > > struct sched_pmd *sched_pmd = NULL;
> > > struct sched_numa *numa;
> > > int port_numa_id;
> > > + bool cross_numa = rxq->port->cross_numa_polling;
> >
> > It may be better to schedule non-cross-numa rxqs first, and then cross-numa
> > rxqs.
> > Otherwise, one numa may hold much more load because of later scheduled non-
> > cross-numa rxqs.
>
>
> For a RxQ assignment, least loaded PMD is chosen, it could be from NUMA0 or
> NUMA1 based on where the least loaded PMD is present.
> The RxQs themselves are sorted and assigned one after the other. They are not
> assigned in any specific NUMA order.
Considering the following situation:
We have 2 numa nodes, each numa node has 1 pmd.
And we have 2 port(p0, p1), each port has 2 queues.
p0 is configured as cross_numa, p1 is not.
each queue's workload,
rxq p0q0 p0q1 p1q0 p1q1
load 30 30 20 20
Based on your current implement, the assignment will be:
p0q0 -> numa0 (30)
p0q1 -> numa1 (30)
p1q0 -> numa0 (30+20=50)
p1q1 -> numa0 (50+20=70)
As the result, numa0 holds 70% workload but numa1 holds only 30%.
Because later assigned queues are numa affinity.
>
> >
> > > uint64_t proc_cycles;
> > > char rxq_cyc_log[MAX_RXQ_CYC_STRLEN];
> > >
> > > @@ -6000,33 +6044,40 @@ sched_numa_list_schedule(struct
> > sched_numa_list *numa_list,
> > > start_logged = true;
> > > }
> > >
> > > - /* Store the cycles for this rxq as we will log these later. */
> > > + /* Store the cycles for this rxq as we will need these later.
> > > + */
> > > proc_cycles = dp_netdev_rxq_get_cycles(rxq,
> > > RXQ_CYCLES_PROC_HIST);
> > >
> > > port_numa_id = netdev_get_numa_id(rxq->port->netdev);
> > >
> > > - /* Select numa. */
> > > - numa = sched_numa_list_lookup(numa_list, port_numa_id);
> > > -
> > > - /* Check if numa has no PMDs or no non-isolated PMDs. */
> > > - if (!numa || !sched_numa_noniso_pmd_count(numa)) {
> > > - /* Unable to use this numa to find a PMD. */
> > > - numa = NULL;
> > > - /* Find any numa with available PMDs. */
> > > - for (int j = 0; j < n_numa; j++) {
> > > - numa = sched_numa_list_next(numa_list, last_cross_numa);
> > > - last_cross_numa = numa;
> > > - if (sched_numa_noniso_pmd_count(numa)) {
> > > - break;
> > > - }
> > > + if (cross_numa && algo == SCHED_GROUP) {
> > > + /* cross_numa polling enabled so find lowest loaded pmd
> > > across
> > > + * all numas. */
> > > + sched_pmd = sched_pmd_all_numa_get_lowest(numa_list,
> > port_numa_id,
> > > + proc_cycles);
> > > + } else {
> > > + /* Select numa. */
> > > + numa = sched_numa_list_lookup(numa_list, port_numa_id);
> > > +
> > > + /* Check if numa has no PMDs or no non-isolated PMDs. */
> > > + if (!numa || !sched_numa_noniso_pmd_count(numa)) {
> > > + /* Unable to use this numa to find a PMD. */
> > > numa = NULL;
> > > + /* Find any numa with available PMDs. */
> > > + for (int j = 0; j < n_numa; j++) {
> > > + numa = sched_numa_list_next(numa_list,
> > > last_cross_numa);
> > > + last_cross_numa = numa;
> > > + if (sched_numa_noniso_pmd_count(numa)) {
> > > + break;
> > > + }
> > > + numa = NULL;
> > > + }
> > > }
> > > - }
> > >
> > > - if (numa) {
> > > - /* Select the PMD that should be used for this rxq. */
> > > - sched_pmd = sched_pmd_next(numa, algo,
> > > - proc_cycles ? true : false);
> > > + if (numa) {
> > > + /* Select the PMD that should be used for this rxq. */
> > > + sched_pmd = sched_pmd_next(numa, algo,
> > > + proc_cycles ? true : false);
> > > + }
> > > }
> > >
> > > /* Check that a pmd has been selected. */ @@ -6036,12
> > > +6087,23 @@ sched_numa_list_schedule(struct sched_numa_list *numa_list,
> > > pmd_numa_id = sched_pmd->numa->numa_id;
> > > /* Check if selected pmd numa matches port numa. */
> > > if (pmd_numa_id != port_numa_id) {
> > > - VLOG(level, "There's no available (non-isolated) pmd
> > > thread "
> > > - "on numa node %d. Port \'%s\' rx queue %d
> > > will "
> > > - "be assigned to a pmd on numa node %d. "
> > > - "This may lead to reduced performance.",
> > > - port_numa_id, netdev_rxq_get_name(rxq->rx),
> > > - netdev_rxq_get_queue_id(rxq->rx),
> > > pmd_numa_id);
> > > + if (cross_numa) {
> > > + VLOG(level, "Cross-numa polling has been selected
> > > for "
> > > + "Port \'%s\' rx queue %d on numa node
> > > %d. "
> > > + "It will be assigned to a pmd on numa
> > > node "
> > > + "%d. This may lead to reduced
> > > performance.",
> > > + netdev_rxq_get_name(rxq->rx),
> > > + netdev_rxq_get_queue_id(rxq->rx),
> > > port_numa_id,
> > > + pmd_numa_id);
> > > +
> > > + } else {
> > > + VLOG(level, "There's no available (non-isolated) pmd
> > > "
> > > + "thread on numa node %d. Port \'%s\' rx
> > > queue "
> > > + "%d will be assigned to a pmd on numa
> > > node "
> > > + "%d. This may lead to reduced
> > > performance.",
> > > + port_numa_id,
> > > netdev_rxq_get_name(rxq->rx),
> > > + netdev_rxq_get_queue_id(rxq->rx),
> > > pmd_numa_id);
> > > + }
> > > }
> > > VLOG(level, "Core %2u on numa node %d assigned port \'%s\' "
> > > "rx queue %d%s.", diff --git a/tests/pmd.at
> > > b/tests/pmd.at index e6b173dab..5d2ec1382 100644
> > > --- a/tests/pmd.at
> > > +++ b/tests/pmd.at
> > > @@ -579,6 +579,44 @@
> > > icmp,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a
> > > ,nw_src=10
> > > OVS_VSWITCHD_STOP
> > > AT_CLEANUP
> > >
> > > +AT_SETUP([PMD - enable cross numa polling])
> > > +OVS_VSWITCHD_START([add-port br0 p0 \
> > > + -- set Interface p0 type=dummy-pmd options:n_rxq=4 \
> > > + -- set Interface p0 options:numa_id=0 \
> > > + -- set Open_vSwitch . other_config:pmd-cpu-mask=0x3 \
> > > + -- set open_vswitch .
> > > other_config:pmd-rxq-assign=group],
> > > + [], [], [--dummy-numa 0,1])
> > > +
> > > +AT_CHECK([ovs-ofctl add-flow br0 action=controller])
> > > +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show |
> > cut -f 3 -d ' ' | sort | uniq], [0], [dnl
> > > +0
> > > +])
> > > +
> > > +dnl Enable cross numa polling and check numa ids TMP=$(($(cat
> > > +ovs-vswitchd.log | wc -l | tr -d [[:blank:]])+1))
> > > +PATTERN="cross-numa-polling has been enabled"
> > > +AT_CHECK([ovs-vsctl set Interface p0
> > > +other_config:cross-numa-polling=true])
> > > +OVS_WAIT_UNTIL([tail -n +$TMP ovs-vswitchd.log | grep "$PATTERN"])
> > > +
> > > +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show |
> > cut -f 3 -d ' ' | sort | uniq], [0], [dnl
> > > +0
> > > +1
> > > +])
> > > +
> > > +dnl Disable cross numa polling and check numa ids TMP=$(($(cat
> > > +ovs-vswitchd.log | wc -l | tr -d [[:blank:]])+1))
> > > +PATTERN="cross-numa-polling has been disabled"
> > > +AT_CHECK([ovs-vsctl set Interface p0
> > > +other_config:cross-numa-polling=false])
> > > +OVS_WAIT_UNTIL([tail -n +$TMP ovs-vswitchd.log | grep "$PATTERN"])
> > > +
> > > +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show |
> > cut -f 3 -d ' ' | sort | uniq], [0], [dnl
> > > +0
> > > +])
> > > +
> > > +OVS_VSWITCHD_STOP(["/|WARN|/d"])
> > > +AT_CLEANUP
> > > +
> > > +
> > > AT_SETUP([PMD - change numa node])
> > > OVS_VSWITCHD_START(
> > > [add-port br0 p1 -- set Interface p1 type=dummy-pmd
> > > ofport_request=1 options:n_rxq=2 -- \ diff --git
> > > a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml index
> > > cc1dd77ec..c2e948bd9 100644
> > > --- a/vswitchd/vswitch.xml
> > > +++ b/vswitchd/vswitch.xml
> > > @@ -3286,6 +3286,26 @@ ovs-vsctl add-port br0 p0 -- set Interface p0
> > type=patch options:peer=p1 \
> > > </p>
> > > </column>
> > >
> > > + <column name="other_config" key="cross-numa-polling"
> > > + type='{"type": "boolean"}'>
> > > + <p>
> > > + Specifies if the RX queues of the port can be automatically
> > > assigned
> > > + to PMD threads on any NUMA node or only on the local NUMA node
> > > of
> > > + the port.
> > > + </p>
> > > + <p>
> > > + Polling of physical ports from a non-local PMD thread incurs
> > > some
> > > + performance penalty due to the access to packet data across
> > > the NUMA
> > > + barrier. This option can still increase the overall
> > > performance if
> > > + it allows better utilization of those non-local PMDs threads.
> > > + It is most useful together with the auto load-balancing of RX
> > > queues
> > > + <ref table="Open_vSwitch" column="other_config" key="pmd-auto-
> > lb"/>
> > > + </p>
> > > + <p>
> > > + Defaults to false.
> > > + </p>
> > > + </column>
> > > +
> > > <column name="options" key="xdp-mode"
> > > type='{"type": "string",
> > > "enum": ["set", ["best-effort",
> > > "native-with-zerocopy",
> > > --
> > > 2.25.1
> > >
> > > _______________________________________________
> > > dev mailing list
> > > [email protected]
> > > https://protect2.fireeye.com/v1/url?k=31323334-501cfaf3-313273af-45444
> > > 5554331-abcfec870399ab89&q=1&e=d638a27f-b902-46d8-883e-
> > ed0c226be33c&u=
> > > https%3A%2F%2Fmail.openvswitch.org%2Fmailman%2Flistinfo%2Fovs-dev
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev