From: Jan Scheurich <jan.scheur...@ericsson.com>

Today dpif-netdev considers PMD threads on a non-local NUMA node for automatic
assignment of the rxqs of a port only if there are no local,non-isolated PMDs.

On typical servers with both physical ports on one NUMA node, this often
leaves the PMDs on the other NUMA node under-utilized, wasting CPU resources.
The alternative, to manually pin the rxqs to PMDs on remote NUMA nodes, also
has drawbacks as it limits OVS' ability to auto load-balance the rxqs.

This patch introduces a new interface configuration option to allow ports to
be automatically polled by PMDs on any NUMA node:

ovs-vsctl set interface <Name> other_config:cross-numa-polling=true

The group assignment algorithm now has the ability to select lowest loaded PMD
on any NUMA, and not just the local NUMA on which the rxq of the port resides

If this option is not present or set to false, legacy behaviour applies.

Co-authored-by: Anurag Agarwal <anurag.agar...@ericsson.com>
Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com>
Signed-off-by: Anurag Agarwal <anurag.agar...@ericsson.com>
---

Changes in this patch:
- Addressed comments from Kevin Traynor

Please refer this thread for an earlier discussion on this topic:
https://mail.openvswitch.org/pipermail/ovs-dev/2022-March/392310.html
---
 Documentation/topics/dpdk/pmd.rst |  23 ++++++
 lib/dpif-netdev.c                 | 130 ++++++++++++++++++++++--------
 tests/pmd.at                      |  38 +++++++++
 vswitchd/vswitch.xml              |  20 +++++
 4 files changed, 177 insertions(+), 34 deletions(-)

diff --git a/Documentation/topics/dpdk/pmd.rst 
b/Documentation/topics/dpdk/pmd.rst
index b259cc8b3..387f962d1 100644
--- a/Documentation/topics/dpdk/pmd.rst
+++ b/Documentation/topics/dpdk/pmd.rst
@@ -99,6 +99,25 @@ core cycles for each Rx queue::
 
     $ ovs-appctl dpif-netdev/pmd-rxq-show
 
+Normally, Rx queues are assigned to PMD threads automatically.  By default
+OVS only assigns Rx queues to PMD threads executing on the same NUMA
+node in order to avoid unnecessary latency for accessing packet buffers
+across the NUMA boundary.  Typically this overhead is higher for vhostuser
+ports than for physical ports due to the packet copy that is done for all
+rx packets.
+
+On NUMA servers with physical ports only on one NUMA node, the NUMA-local
+polling policy can lead to an under-utilization of the PMD threads on the
+remote NUMA node.  For the overall OVS performance it may in such cases be
+beneficial to utilize the spare capacity and allow polling of a physical
+port's rxqs across NUMA nodes despite the overhead involved.
+The policy can be set per port with the following configuration option::
+
+    $ ovs-vsctl set Interface <iface> \
+        other_config:cross-numa-polling=true|false
+
+The default value is false.
+
 .. note::
 
    A history of one minute is recorded and shown for each Rx queue to allow for
@@ -115,6 +134,10 @@ core cycles for each Rx queue::
    A ``overhead`` statistics is shown per PMD: it represents the number of
    cycles inherently consumed by the OVS PMD processing loop.
 
+.. versionchanged:: 2.18.0
+
+   Added the interface parameter ``other_config:cross-numa-polling``
+
 Rx queue to PMD assignment takes place whenever there are configuration changes
 or can be triggered by using::
 
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index ff57b3961..86f88964b 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -467,6 +467,7 @@ struct dp_netdev_port {
     char *type;                 /* Port type as requested by user. */
     char *rxq_affinity_list;    /* Requested affinity of rx queues. */
     enum txq_req_mode txq_requested_mode;
+    bool cross_numa_polling;    /* If true cross polling will be enabled. */
 };
 
 static bool dp_netdev_flow_ref(struct dp_netdev_flow *);
@@ -2101,6 +2102,7 @@ port_create(const char *devname, const char *type,
     port->sf = NULL;
     port->emc_enabled = true;
     port->need_reconfigure = true;
+    port->cross_numa_polling = false;
     ovs_mutex_init(&port->txq_used_mutex);
 
     *portp = port;
@@ -5013,6 +5015,7 @@ dpif_netdev_port_set_config(struct dpif *dpif, odp_port_t 
port_no,
     bool emc_enabled = smap_get_bool(cfg, "emc-enable", true);
     const char *tx_steering_mode = smap_get(cfg, "tx-steering");
     enum txq_req_mode txq_mode;
+    bool cross_numa_polling = smap_get_bool(cfg, "cross-numa-polling", false);
 
     ovs_rwlock_wrlock(&dp->port_rwlock);
     error = get_port_by_number(dp, port_no, &port);
@@ -5020,6 +5023,14 @@ dpif_netdev_port_set_config(struct dpif *dpif, 
odp_port_t port_no,
         goto unlock;
     }
 
+    if (cross_numa_polling != port->cross_numa_polling) {
+        port->cross_numa_polling = cross_numa_polling;
+        VLOG_INFO("%s:cross-numa-polling has been %s.",
+                  netdev_get_name(port->netdev),
+                  cross_numa_polling? "enabled" : "disabled");
+        dp_netdev_request_reconfigure(dp);
+    }
+
     if (emc_enabled != port->emc_enabled) {
         struct dp_netdev_pmd_thread *pmd;
         struct ds ds = DS_EMPTY_INITIALIZER;
@@ -5751,7 +5762,7 @@ compare_rxq_cycles(const void *a, const void *b)
 
 static bool
 sched_pmd_new_lowest(struct sched_pmd *current_lowest, struct sched_pmd *pmd,
-                     bool has_proc)
+                     int port_numa_id, bool has_proc)
 {
     uint64_t current_num, pmd_num;
 
@@ -5767,16 +5778,22 @@ sched_pmd_new_lowest(struct sched_pmd *current_lowest, 
struct sched_pmd *pmd,
         pmd_num = pmd->n_rxq;
     }
 
-    if (pmd_num < current_num) {
-        return true;
+    if (pmd_num != current_num) {
+        return (pmd_num < current_num) ? true : false;
     }
-    return false;
+
+    /* In case of a tie, select the new PMD only if the existing PMD
+     * assigned is from non-local NUMA
+     */
+    return (current_lowest->numa->numa_id != port_numa_id &&
+            pmd->numa->numa_id == port_numa_id) ? true : false;
+
 }
 
 static struct sched_pmd *
 sched_pmd_get_lowest(struct sched_numa *numa, bool has_cyc)
 {
-    struct sched_pmd *lowest_sched_pmd = NULL;
+    struct sched_pmd *lowest_pmd = NULL;
 
     for (unsigned i = 0; i < numa->n_pmds; i++) {
         struct sched_pmd *sched_pmd;
@@ -5785,11 +5802,11 @@ sched_pmd_get_lowest(struct sched_numa *numa, bool 
has_cyc)
         if (sched_pmd->isolated) {
             continue;
         }
-        if (sched_pmd_new_lowest(lowest_sched_pmd, sched_pmd, has_cyc)) {
-            lowest_sched_pmd = sched_pmd;
+        if (sched_pmd_new_lowest(lowest_pmd, sched_pmd, INT_MAX, has_cyc)) {
+            lowest_pmd = sched_pmd;
         }
     }
-    return lowest_sched_pmd;
+    return lowest_pmd;
 }
 
 /*
@@ -5891,6 +5908,32 @@ get_rxq_cyc_log(char *a, enum sched_assignment_type 
algo, uint64_t cycles)
     return a;
 }
 
+static struct sched_pmd *
+sched_pmd_all_numa_get_lowest(struct sched_numa_list *numa_list,
+                              int port_numa_id, bool has_proc) {
+    int n_numa;
+    struct sched_numa *numa = NULL;
+    struct sched_numa *last_numa = NULL;
+    struct sched_pmd *lowest_pmd = NULL;
+    struct sched_pmd *pmd;
+
+    n_numa = sched_numa_list_count(numa_list);
+    /* For all numas. */
+    for (int i = 0; i < n_numa; i++) {
+        last_numa = numa;
+        numa = sched_numa_list_next(numa_list, last_numa);
+
+        /* Get the lowest pmd per numa. */
+        pmd = sched_pmd_get_lowest(numa, has_proc);
+
+        /* Check if it's the lowest pmd for all numas. */
+        if (sched_pmd_new_lowest(lowest_pmd, pmd, port_numa_id, has_proc)) {
+            lowest_pmd = pmd;
+        }
+    }
+    return lowest_pmd;
+}
+
 static void
 sched_numa_list_schedule(struct sched_numa_list *numa_list,
                          struct dp_netdev *dp,
@@ -5991,6 +6034,7 @@ sched_numa_list_schedule(struct sched_numa_list 
*numa_list,
         struct sched_pmd *sched_pmd = NULL;
         struct sched_numa *numa;
         int port_numa_id;
+        bool cross_numa = rxq->port->cross_numa_polling;
         uint64_t proc_cycles;
         char rxq_cyc_log[MAX_RXQ_CYC_STRLEN];
 
@@ -6000,33 +6044,40 @@ sched_numa_list_schedule(struct sched_numa_list 
*numa_list,
             start_logged = true;
         }
 
-        /* Store the cycles for this rxq as we will log these later. */
+        /* Store the cycles for this rxq as we will need these later. */
         proc_cycles = dp_netdev_rxq_get_cycles(rxq, RXQ_CYCLES_PROC_HIST);
 
         port_numa_id = netdev_get_numa_id(rxq->port->netdev);
 
-        /* Select numa. */
-        numa = sched_numa_list_lookup(numa_list, port_numa_id);
-
-        /* Check if numa has no PMDs or no non-isolated PMDs. */
-        if (!numa || !sched_numa_noniso_pmd_count(numa)) {
-            /* Unable to use this numa to find a PMD. */
-            numa = NULL;
-            /* Find any numa with available PMDs. */
-            for (int j = 0; j < n_numa; j++) {
-                numa = sched_numa_list_next(numa_list, last_cross_numa);
-                last_cross_numa = numa;
-                if (sched_numa_noniso_pmd_count(numa)) {
-                    break;
-                }
+        if (cross_numa && algo == SCHED_GROUP) {
+            /* cross_numa polling enabled so find lowest loaded pmd across
+             * all numas. */
+            sched_pmd = sched_pmd_all_numa_get_lowest(numa_list, port_numa_id,
+                                                      proc_cycles);
+        } else {
+            /* Select numa. */
+            numa = sched_numa_list_lookup(numa_list, port_numa_id);
+
+            /* Check if numa has no PMDs or no non-isolated PMDs. */
+            if (!numa || !sched_numa_noniso_pmd_count(numa)) {
+                /* Unable to use this numa to find a PMD. */
                 numa = NULL;
+                /* Find any numa with available PMDs. */
+                for (int j = 0; j < n_numa; j++) {
+                    numa = sched_numa_list_next(numa_list, last_cross_numa);
+                    last_cross_numa = numa;
+                    if (sched_numa_noniso_pmd_count(numa)) {
+                        break;
+                    }
+                    numa = NULL;
+                }
             }
-        }
 
-        if (numa) {
-            /* Select the PMD that should be used for this rxq. */
-            sched_pmd = sched_pmd_next(numa, algo,
-                                       proc_cycles ? true : false);
+            if (numa) {
+                /* Select the PMD that should be used for this rxq. */
+                sched_pmd = sched_pmd_next(numa, algo,
+                                           proc_cycles ? true : false);
+            }
         }
 
         /* Check that a pmd has been selected. */
@@ -6036,12 +6087,23 @@ sched_numa_list_schedule(struct sched_numa_list 
*numa_list,
             pmd_numa_id = sched_pmd->numa->numa_id;
             /* Check if selected pmd numa matches port numa. */
             if (pmd_numa_id != port_numa_id) {
-                VLOG(level, "There's no available (non-isolated) pmd thread "
-                            "on numa node %d. Port \'%s\' rx queue %d will "
-                            "be assigned to a pmd on numa node %d. "
-                            "This may lead to reduced performance.",
-                            port_numa_id, netdev_rxq_get_name(rxq->rx),
-                            netdev_rxq_get_queue_id(rxq->rx), pmd_numa_id);
+                if (cross_numa) {
+                    VLOG(level, "Cross-numa polling has been selected for "
+                                "Port \'%s\' rx queue %d on numa node %d. "
+                                "It will be assigned to a pmd on numa node "
+                                "%d. This may lead to reduced performance.",
+                                netdev_rxq_get_name(rxq->rx),
+                                netdev_rxq_get_queue_id(rxq->rx), port_numa_id,
+                                pmd_numa_id);
+
+                } else {
+                    VLOG(level, "There's no available (non-isolated) pmd "
+                                "thread on numa node %d. Port \'%s\' rx queue "
+                                "%d will be assigned to a pmd on numa node "
+                                "%d. This may lead to reduced performance.",
+                                port_numa_id, netdev_rxq_get_name(rxq->rx),
+                                netdev_rxq_get_queue_id(rxq->rx), pmd_numa_id);
+                }
             }
             VLOG(level, "Core %2u on numa node %d assigned port \'%s\' "
                         "rx queue %d%s.",
diff --git a/tests/pmd.at b/tests/pmd.at
index e6b173dab..5d2ec1382 100644
--- a/tests/pmd.at
+++ b/tests/pmd.at
@@ -579,6 +579,44 @@ 
icmp,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,nw_src=10
 OVS_VSWITCHD_STOP
 AT_CLEANUP
 
+AT_SETUP([PMD - enable cross numa polling])
+OVS_VSWITCHD_START([add-port br0 p0 \
+                    -- set Interface p0 type=dummy-pmd options:n_rxq=4 \
+                    -- set Interface p0 options:numa_id=0 \
+                    -- set Open_vSwitch . other_config:pmd-cpu-mask=0x3 \
+                    -- set open_vswitch . other_config:pmd-rxq-assign=group],
+                   [], [], [--dummy-numa 0,1])
+
+AT_CHECK([ovs-ofctl add-flow br0 action=controller])
+AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show | cut -f 3 
-d ' ' | sort |      uniq], [0], [dnl
+0
+])
+
+dnl Enable cross numa polling and check numa ids
+TMP=$(($(cat ovs-vswitchd.log | wc -l | tr -d [[:blank:]])+1))
+PATTERN="cross-numa-polling has been enabled"
+AT_CHECK([ovs-vsctl set Interface p0 other_config:cross-numa-polling=true])
+OVS_WAIT_UNTIL([tail -n +$TMP ovs-vswitchd.log | grep "$PATTERN"])
+
+AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show | cut -f 3 
-d ' ' | sort |      uniq], [0], [dnl
+0
+1
+])
+
+dnl Disable cross numa polling and check numa ids
+TMP=$(($(cat ovs-vswitchd.log | wc -l | tr -d [[:blank:]])+1))
+PATTERN="cross-numa-polling has been disabled"
+AT_CHECK([ovs-vsctl set Interface p0 other_config:cross-numa-polling=false])
+OVS_WAIT_UNTIL([tail -n +$TMP ovs-vswitchd.log | grep "$PATTERN"])
+
+AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show | cut -f 3 
-d ' ' | sort |      uniq], [0], [dnl
+0
+])
+
+OVS_VSWITCHD_STOP(["/|WARN|/d"])
+AT_CLEANUP
+
+
 AT_SETUP([PMD - change numa node])
 OVS_VSWITCHD_START(
   [add-port br0 p1 -- set Interface p1 type=dummy-pmd ofport_request=1 
options:n_rxq=2 -- \
diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
index cc1dd77ec..c2e948bd9 100644
--- a/vswitchd/vswitch.xml
+++ b/vswitchd/vswitch.xml
@@ -3286,6 +3286,26 @@ ovs-vsctl add-port br0 p0 -- set Interface p0 type=patch 
options:peer=p1 \
         </p>
       </column>
 
+      <column name="other_config" key="cross-numa-polling"
+                                  type='{"type": "boolean"}'>
+        <p>
+          Specifies if the RX queues of the port can be automatically assigned
+          to PMD threads on any NUMA node or only on the local NUMA node of
+          the port.
+        </p>
+        <p>
+          Polling of physical ports from a non-local PMD thread incurs some
+          performance penalty due to the access to packet data across the NUMA
+          barrier. This option can still increase the overall performance if
+          it allows better utilization of those non-local PMDs threads.
+          It is most useful together with the auto load-balancing of RX queues
+          <ref table="Open_vSwitch" column="other_config" key="pmd-auto-lb"/>
+        </p>
+        <p>
+          Defaults to false.
+        </p>
+      </column>
+
       <column name="options" key="xdp-mode"
               type='{"type": "string",
                      "enum": ["set", ["best-effort", "native-with-zerocopy",
-- 
2.25.1

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to