On Wed, Jun 22, 2022 at 09:32:29AM +0000, Anurag Agarwal wrote:
> Thanks for your feedback. Please find my comments inline. 
> 
> Regards,
> Anurag
> 
> > -----Original Message-----
> > From: lic...@chinatelecom.cn <lic...@chinatelecom.cn>
> > Sent: Friday, June 17, 2022 2:29 PM
> > To: Anurag Agarwal <anura...@gmail.com>
> > Cc: ovs-dev@openvswitch.org
> > Subject: Re: [ovs-dev] [PATCH v4] dpif-netdev: Allow cross-NUMA polling on
> > selected ports
> > 
> > On Wed, Jun 08, 2022 at 03:58:31PM +0530, Anurag Agarwal wrote:
> > > From: Jan Scheurich <jan.scheur...@ericsson.com>
> > >
> > > Today dpif-netdev considers PMD threads on a non-local NUMA node for
> > > automatic assignment of the rxqs of a port only if there are no local,non-
> > isolated PMDs.
> > >
> > > On typical servers with both physical ports on one NUMA node, this
> > > often leaves the PMDs on the other NUMA node under-utilized, wasting CPU
> > resources.
> > > The alternative, to manually pin the rxqs to PMDs on remote NUMA
> > > nodes, also has drawbacks as it limits OVS' ability to auto load-balance 
> > > the
> > rxqs.
> > >
> > > This patch introduces a new interface configuration option to allow
> > > ports to be automatically polled by PMDs on any NUMA node:
> > >
> > > ovs-vsctl set interface <Name> other_config:cross-numa-polling=true
> > >
> > > The group assignment algorithm now has the ability to select lowest
> > > loaded PMD on any NUMA, and not just the local NUMA on which the rxq
> > > of the port resides
> > >
> > > If this option is not present or set to false, legacy behaviour applies.
> > >
> > > Co-authored-by: Anurag Agarwal <anurag.agar...@ericsson.com>
> > > Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com>
> > > Signed-off-by: Anurag Agarwal <anurag.agar...@ericsson.com>
> > > ---
> > >
> > > Changes in this patch:
> > > - Addressed comments from Kevin Traynor
> > >
> > > Please refer this thread for an earlier discussion on this topic:
> > > https://protect2.fireeye.com/v1/url?k=31323334-501cfaf3-313273af-45444
> > > 5554331-165fefd0f093b353&q=1&e=d638a27f-b902-46d8-883e-
> > ed0c226be33c&u=
> > > https%3A%2F%2Fmail.openvswitch.org%2Fpipermail%2Fovs-dev%2F2022-
> > March%
> > > 2F392310.html
> > > ---
> > >  Documentation/topics/dpdk/pmd.rst |  23 ++++++
> > >  lib/dpif-netdev.c                 | 130 ++++++++++++++++++++++--------
> > >  tests/pmd.at                      |  38 +++++++++
> > >  vswitchd/vswitch.xml              |  20 +++++
> > >  4 files changed, 177 insertions(+), 34 deletions(-)
> > >
> > > diff --git a/Documentation/topics/dpdk/pmd.rst
> > > b/Documentation/topics/dpdk/pmd.rst
> > > index b259cc8b3..387f962d1 100644
> > > --- a/Documentation/topics/dpdk/pmd.rst
> > > +++ b/Documentation/topics/dpdk/pmd.rst
> > > @@ -99,6 +99,25 @@ core cycles for each Rx queue::
> > >
> > >      $ ovs-appctl dpif-netdev/pmd-rxq-show
> > >
> > > +Normally, Rx queues are assigned to PMD threads automatically.  By
> > > +default OVS only assigns Rx queues to PMD threads executing on the
> > > +same NUMA node in order to avoid unnecessary latency for accessing
> > > +packet buffers across the NUMA boundary.  Typically this overhead is
> > > +higher for vhostuser ports than for physical ports due to the packet
> > > +copy that is done for all rx packets.
> > > +
> > > +On NUMA servers with physical ports only on one NUMA node, the
> > > +NUMA-local polling policy can lead to an under-utilization of the PMD
> > > +threads on the remote NUMA node.  For the overall OVS performance it
> > > +may in such cases be beneficial to utilize the spare capacity and
> > > +allow polling of a physical port's rxqs across NUMA nodes despite the
> > overhead involved.
> > > +The policy can be set per port with the following configuration option::
> > > +
> > > +    $ ovs-vsctl set Interface <iface> \
> > > +        other_config:cross-numa-polling=true|false
> > > +
> > > +The default value is false.
> > > +
> > >  .. note::
> > >
> > >     A history of one minute is recorded and shown for each Rx queue to
> > > allow for @@ -115,6 +134,10 @@ core cycles for each Rx queue::
> > >     A ``overhead`` statistics is shown per PMD: it represents the number 
> > > of
> > >     cycles inherently consumed by the OVS PMD processing loop.
> > >
> > > +.. versionchanged:: 2.18.0
> > > +
> > > +   Added the interface parameter ``other_config:cross-numa-polling``
> > > +
> > >  Rx queue to PMD assignment takes place whenever there are
> > > configuration changes  or can be triggered by using::
> > >
> > > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
> > > ff57b3961..86f88964b 100644
> > > --- a/lib/dpif-netdev.c
> > > +++ b/lib/dpif-netdev.c
> > > @@ -467,6 +467,7 @@ struct dp_netdev_port {
> > >      char *type;                 /* Port type as requested by user. */
> > >      char *rxq_affinity_list;    /* Requested affinity of rx queues. */
> > >      enum txq_req_mode txq_requested_mode;
> > > +    bool cross_numa_polling;    /* If true cross polling will be 
> > > enabled. */
> > >  };
> > >
> > >  static bool dp_netdev_flow_ref(struct dp_netdev_flow *); @@ -2101,6
> > > +2102,7 @@ port_create(const char *devname, const char *type,
> > >      port->sf = NULL;
> > >      port->emc_enabled = true;
> > >      port->need_reconfigure = true;
> > > +    port->cross_numa_polling = false;
> > >      ovs_mutex_init(&port->txq_used_mutex);
> > >
> > >      *portp = port;
> > > @@ -5013,6 +5015,7 @@ dpif_netdev_port_set_config(struct dpif *dpif,
> > odp_port_t port_no,
> > >      bool emc_enabled = smap_get_bool(cfg, "emc-enable", true);
> > >      const char *tx_steering_mode = smap_get(cfg, "tx-steering");
> > >      enum txq_req_mode txq_mode;
> > > +    bool cross_numa_polling = smap_get_bool(cfg,
> > > + "cross-numa-polling", false);
> > >
> > >      ovs_rwlock_wrlock(&dp->port_rwlock);
> > >      error = get_port_by_number(dp, port_no, &port); @@ -5020,6
> > > +5023,14 @@ dpif_netdev_port_set_config(struct dpif *dpif, odp_port_t
> > port_no,
> > >          goto unlock;
> > >      }
> > >
> > > +    if (cross_numa_polling != port->cross_numa_polling) {
> > > +        port->cross_numa_polling = cross_numa_polling;
> > > +        VLOG_INFO("%s:cross-numa-polling has been %s.",
> > > +                  netdev_get_name(port->netdev),
> > > +                  cross_numa_polling? "enabled" : "disabled");
> > > +        dp_netdev_request_reconfigure(dp);
> > > +    }
> > > +
> > >      if (emc_enabled != port->emc_enabled) {
> > >          struct dp_netdev_pmd_thread *pmd;
> > >          struct ds ds = DS_EMPTY_INITIALIZER; @@ -5751,7 +5762,7 @@
> > > compare_rxq_cycles(const void *a, const void *b)
> > >
> > >  static bool
> > >  sched_pmd_new_lowest(struct sched_pmd *current_lowest, struct
> > sched_pmd *pmd,
> > > -                     bool has_proc)
> > > +                     int port_numa_id, bool has_proc)
> > >  {
> > >      uint64_t current_num, pmd_num;
> > >
> > > @@ -5767,16 +5778,22 @@ sched_pmd_new_lowest(struct sched_pmd
> > *current_lowest, struct sched_pmd *pmd,
> > >          pmd_num = pmd->n_rxq;
> > >      }
> > >
> > > -    if (pmd_num < current_num) {
> > > -        return true;
> > > +    if (pmd_num != current_num) {
> > > +        return (pmd_num < current_num) ? true : false;
> > >      }
> > > -    return false;
> > > +
> > > +    /* In case of a tie, select the new PMD only if the existing PMD
> > > +     * assigned is from non-local NUMA
> > > +     */
> > > +    return (current_lowest->numa->numa_id != port_numa_id &&
> > > +            pmd->numa->numa_id == port_numa_id) ? true : false;
> > > +
> > >  }
> > >
> > >  static struct sched_pmd *
> > >  sched_pmd_get_lowest(struct sched_numa *numa, bool has_cyc)  {
> > > -    struct sched_pmd *lowest_sched_pmd = NULL;
> > > +    struct sched_pmd *lowest_pmd = NULL;
> > >
> > >      for (unsigned i = 0; i < numa->n_pmds; i++) {
> > >          struct sched_pmd *sched_pmd;
> > > @@ -5785,11 +5802,11 @@ sched_pmd_get_lowest(struct sched_numa
> > *numa, bool has_cyc)
> > >          if (sched_pmd->isolated) {
> > >              continue;
> > >          }
> > > -        if (sched_pmd_new_lowest(lowest_sched_pmd, sched_pmd, has_cyc)) {
> > > -            lowest_sched_pmd = sched_pmd;
> > > +        if (sched_pmd_new_lowest(lowest_pmd, sched_pmd, INT_MAX,
> > has_cyc)) {
> > > +            lowest_pmd = sched_pmd;
> > >          }
> > >      }
> > > -    return lowest_sched_pmd;
> > > +    return lowest_pmd;
> > >  }
> > >
> > >  /*
> > > @@ -5891,6 +5908,32 @@ get_rxq_cyc_log(char *a, enum
> > sched_assignment_type algo, uint64_t cycles)
> > >      return a;
> > >  }
> > >
> > > +static struct sched_pmd *
> > > +sched_pmd_all_numa_get_lowest(struct sched_numa_list *numa_list,
> > > +                              int port_numa_id, bool has_proc) {
> > > +    int n_numa;
> > > +    struct sched_numa *numa = NULL;
> > > +    struct sched_numa *last_numa = NULL;
> > > +    struct sched_pmd *lowest_pmd = NULL;
> > > +    struct sched_pmd *pmd;
> > > +
> > > +    n_numa = sched_numa_list_count(numa_list);
> > > +    /* For all numas. */
> > > +    for (int i = 0; i < n_numa; i++) {
> > > +        last_numa = numa;
> > > +        numa = sched_numa_list_next(numa_list, last_numa);
> > > +
> > > +        /* Get the lowest pmd per numa. */
> > > +        pmd = sched_pmd_get_lowest(numa, has_proc);
> > > +
> > > +        /* Check if it's the lowest pmd for all numas. */
> > > +        if (sched_pmd_new_lowest(lowest_pmd, pmd, port_numa_id, 
> > > has_proc))
> > {
> > > +            lowest_pmd = pmd;
> > > +        }
> > > +    }
> > > +    return lowest_pmd;
> > > +}
> > > +
> > >  static void
> > >  sched_numa_list_schedule(struct sched_numa_list *numa_list,
> > >                           struct dp_netdev *dp, @@ -5991,6 +6034,7 @@
> > > sched_numa_list_schedule(struct sched_numa_list *numa_list,
> > >          struct sched_pmd *sched_pmd = NULL;
> > >          struct sched_numa *numa;
> > >          int port_numa_id;
> > > +        bool cross_numa = rxq->port->cross_numa_polling;
> > 
> > It may be better to schedule non-cross-numa rxqs first, and then cross-numa
> > rxqs.
> > Otherwise, one numa may hold much more load because of later scheduled non-
> > cross-numa rxqs.
> 
> 
> For a RxQ assignment, least loaded PMD is chosen, it could be from NUMA0 or 
> NUMA1 based on where the least loaded PMD is present. 
> The RxQs themselves are sorted and assigned one after the other. They are not 
> assigned in any specific NUMA order. 

Considering the following situation:

We have 2 numa nodes, each numa node has 1 pmd.
And we have 2 port(p0, p1), each port has 2 queues.
p0 is configured as cross_numa, p1 is not.

each queue's workload,

rxq   p0q0 p0q1 p1q0 p1q1 
load  30   30   20   20

Based on your current implement, the assignment will be:

p0q0 -> numa0 (30)
p0q1 -> numa1 (30)
p1q0 -> numa0 (30+20=50)
p1q1 -> numa0 (50+20=70)

As the result, numa0 holds 70% workload but numa1 holds only 30%.
Because later assigned queues are numa affinity.

> 
> > 
> > >          uint64_t proc_cycles;
> > >          char rxq_cyc_log[MAX_RXQ_CYC_STRLEN];
> > >
> > > @@ -6000,33 +6044,40 @@ sched_numa_list_schedule(struct
> > sched_numa_list *numa_list,
> > >              start_logged = true;
> > >          }
> > >
> > > -        /* Store the cycles for this rxq as we will log these later. */
> > > +        /* Store the cycles for this rxq as we will need these later.
> > > + */
> > >          proc_cycles = dp_netdev_rxq_get_cycles(rxq,
> > > RXQ_CYCLES_PROC_HIST);
> > >
> > >          port_numa_id = netdev_get_numa_id(rxq->port->netdev);
> > >
> > > -        /* Select numa. */
> > > -        numa = sched_numa_list_lookup(numa_list, port_numa_id);
> > > -
> > > -        /* Check if numa has no PMDs or no non-isolated PMDs. */
> > > -        if (!numa || !sched_numa_noniso_pmd_count(numa)) {
> > > -            /* Unable to use this numa to find a PMD. */
> > > -            numa = NULL;
> > > -            /* Find any numa with available PMDs. */
> > > -            for (int j = 0; j < n_numa; j++) {
> > > -                numa = sched_numa_list_next(numa_list, last_cross_numa);
> > > -                last_cross_numa = numa;
> > > -                if (sched_numa_noniso_pmd_count(numa)) {
> > > -                    break;
> > > -                }
> > > +        if (cross_numa && algo == SCHED_GROUP) {
> > > +            /* cross_numa polling enabled so find lowest loaded pmd 
> > > across
> > > +             * all numas. */
> > > +            sched_pmd = sched_pmd_all_numa_get_lowest(numa_list,
> > port_numa_id,
> > > +                                                      proc_cycles);
> > > +        } else {
> > > +            /* Select numa. */
> > > +            numa = sched_numa_list_lookup(numa_list, port_numa_id);
> > > +
> > > +            /* Check if numa has no PMDs or no non-isolated PMDs. */
> > > +            if (!numa || !sched_numa_noniso_pmd_count(numa)) {
> > > +                /* Unable to use this numa to find a PMD. */
> > >                  numa = NULL;
> > > +                /* Find any numa with available PMDs. */
> > > +                for (int j = 0; j < n_numa; j++) {
> > > +                    numa = sched_numa_list_next(numa_list, 
> > > last_cross_numa);
> > > +                    last_cross_numa = numa;
> > > +                    if (sched_numa_noniso_pmd_count(numa)) {
> > > +                        break;
> > > +                    }
> > > +                    numa = NULL;
> > > +                }
> > >              }
> > > -        }
> > >
> > > -        if (numa) {
> > > -            /* Select the PMD that should be used for this rxq. */
> > > -            sched_pmd = sched_pmd_next(numa, algo,
> > > -                                       proc_cycles ? true : false);
> > > +            if (numa) {
> > > +                /* Select the PMD that should be used for this rxq. */
> > > +                sched_pmd = sched_pmd_next(numa, algo,
> > > +                                           proc_cycles ? true : false);
> > > +            }
> > >          }
> > >
> > >          /* Check that a pmd has been selected. */ @@ -6036,12
> > > +6087,23 @@ sched_numa_list_schedule(struct sched_numa_list *numa_list,
> > >              pmd_numa_id = sched_pmd->numa->numa_id;
> > >              /* Check if selected pmd numa matches port numa. */
> > >              if (pmd_numa_id != port_numa_id) {
> > > -                VLOG(level, "There's no available (non-isolated) pmd 
> > > thread "
> > > -                            "on numa node %d. Port \'%s\' rx queue %d 
> > > will "
> > > -                            "be assigned to a pmd on numa node %d. "
> > > -                            "This may lead to reduced performance.",
> > > -                            port_numa_id, netdev_rxq_get_name(rxq->rx),
> > > -                            netdev_rxq_get_queue_id(rxq->rx), 
> > > pmd_numa_id);
> > > +                if (cross_numa) {
> > > +                    VLOG(level, "Cross-numa polling has been selected 
> > > for "
> > > +                                "Port \'%s\' rx queue %d on numa node 
> > > %d. "
> > > +                                "It will be assigned to a pmd on numa 
> > > node "
> > > +                                "%d. This may lead to reduced 
> > > performance.",
> > > +                                netdev_rxq_get_name(rxq->rx),
> > > +                                netdev_rxq_get_queue_id(rxq->rx), 
> > > port_numa_id,
> > > +                                pmd_numa_id);
> > > +
> > > +                } else {
> > > +                    VLOG(level, "There's no available (non-isolated) pmd 
> > > "
> > > +                                "thread on numa node %d. Port \'%s\' rx 
> > > queue "
> > > +                                "%d will be assigned to a pmd on numa 
> > > node "
> > > +                                "%d. This may lead to reduced 
> > > performance.",
> > > +                                port_numa_id, 
> > > netdev_rxq_get_name(rxq->rx),
> > > +                                netdev_rxq_get_queue_id(rxq->rx), 
> > > pmd_numa_id);
> > > +                }
> > >              }
> > >              VLOG(level, "Core %2u on numa node %d assigned port \'%s\' "
> > >                          "rx queue %d%s.", diff --git a/tests/pmd.at
> > > b/tests/pmd.at index e6b173dab..5d2ec1382 100644
> > > --- a/tests/pmd.at
> > > +++ b/tests/pmd.at
> > > @@ -579,6 +579,44 @@
> > > icmp,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a
> > > ,nw_src=10
> > >  OVS_VSWITCHD_STOP
> > >  AT_CLEANUP
> > >
> > > +AT_SETUP([PMD - enable cross numa polling])
> > > +OVS_VSWITCHD_START([add-port br0 p0 \
> > > +                    -- set Interface p0 type=dummy-pmd options:n_rxq=4 \
> > > +                    -- set Interface p0 options:numa_id=0 \
> > > +                    -- set Open_vSwitch . other_config:pmd-cpu-mask=0x3 \
> > > +                    -- set open_vswitch . 
> > > other_config:pmd-rxq-assign=group],
> > > +                   [], [], [--dummy-numa 0,1])
> > > +
> > > +AT_CHECK([ovs-ofctl add-flow br0 action=controller])
> > > +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show |
> > cut -f 3 -d ' ' | sort |      uniq], [0], [dnl
> > > +0
> > > +])
> > > +
> > > +dnl Enable cross numa polling and check numa ids TMP=$(($(cat
> > > +ovs-vswitchd.log | wc -l | tr -d [[:blank:]])+1))
> > > +PATTERN="cross-numa-polling has been enabled"
> > > +AT_CHECK([ovs-vsctl set Interface p0
> > > +other_config:cross-numa-polling=true])
> > > +OVS_WAIT_UNTIL([tail -n +$TMP ovs-vswitchd.log | grep "$PATTERN"])
> > > +
> > > +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show |
> > cut -f 3 -d ' ' | sort |      uniq], [0], [dnl
> > > +0
> > > +1
> > > +])
> > > +
> > > +dnl Disable cross numa polling and check numa ids TMP=$(($(cat
> > > +ovs-vswitchd.log | wc -l | tr -d [[:blank:]])+1))
> > > +PATTERN="cross-numa-polling has been disabled"
> > > +AT_CHECK([ovs-vsctl set Interface p0
> > > +other_config:cross-numa-polling=false])
> > > +OVS_WAIT_UNTIL([tail -n +$TMP ovs-vswitchd.log | grep "$PATTERN"])
> > > +
> > > +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show |
> > cut -f 3 -d ' ' | sort |      uniq], [0], [dnl
> > > +0
> > > +])
> > > +
> > > +OVS_VSWITCHD_STOP(["/|WARN|/d"])
> > > +AT_CLEANUP
> > > +
> > > +
> > >  AT_SETUP([PMD - change numa node])
> > >  OVS_VSWITCHD_START(
> > >    [add-port br0 p1 -- set Interface p1 type=dummy-pmd
> > > ofport_request=1 options:n_rxq=2 -- \ diff --git
> > > a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml index
> > > cc1dd77ec..c2e948bd9 100644
> > > --- a/vswitchd/vswitch.xml
> > > +++ b/vswitchd/vswitch.xml
> > > @@ -3286,6 +3286,26 @@ ovs-vsctl add-port br0 p0 -- set Interface p0
> > type=patch options:peer=p1 \
> > >          </p>
> > >        </column>
> > >
> > > +      <column name="other_config" key="cross-numa-polling"
> > > +                                  type='{"type": "boolean"}'>
> > > +        <p>
> > > +          Specifies if the RX queues of the port can be automatically 
> > > assigned
> > > +          to PMD threads on any NUMA node or only on the local NUMA node 
> > > of
> > > +          the port.
> > > +        </p>
> > > +        <p>
> > > +          Polling of physical ports from a non-local PMD thread incurs 
> > > some
> > > +          performance penalty due to the access to packet data across 
> > > the NUMA
> > > +          barrier. This option can still increase the overall 
> > > performance if
> > > +          it allows better utilization of those non-local PMDs threads.
> > > +          It is most useful together with the auto load-balancing of RX 
> > > queues
> > > +          <ref table="Open_vSwitch" column="other_config" key="pmd-auto-
> > lb"/>
> > > +        </p>
> > > +        <p>
> > > +          Defaults to false.
> > > +        </p>
> > > +      </column>
> > > +
> > >        <column name="options" key="xdp-mode"
> > >                type='{"type": "string",
> > >                       "enum": ["set", ["best-effort",
> > > "native-with-zerocopy",
> > > --
> > > 2.25.1
> > >
> > > _______________________________________________
> > > dev mailing list
> > > d...@openvswitch.org
> > > https://protect2.fireeye.com/v1/url?k=31323334-501cfaf3-313273af-45444
> > > 5554331-abcfec870399ab89&q=1&e=d638a27f-b902-46d8-883e-
> > ed0c226be33c&u=
> > > https%3A%2F%2Fmail.openvswitch.org%2Fmailman%2Flistinfo%2Fovs-dev
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to