Hello Kevin,

> -----Original Message-----
> From: Kevin Traynor <[email protected]>
> Sent: Thursday, June 23, 2022 9:07 PM
> To: Anurag Agarwal <[email protected]>; [email protected]
> Cc: [email protected]
> Subject: Re: [ovs-dev] [PATCH v5] dpif-netdev: Allow cross-NUMA polling on
> selected ports
> 
> Hi Anurag,
> 
> On 23/06/2022 11:18, Anurag Agarwal wrote:
> > From: Jan Scheurich<[email protected]>
> >
> > Today dpif-netdev considers PMD threads on a non-local NUMA node for
> > automatic assignment of the rxqs of a port only if there are no local,non-
> isolated PMDs.
> >
> > On typical servers with both physical ports on one NUMA node, this
> > often leaves the PMDs on the other NUMA node under-utilized, wasting CPU
> resources.
> > The alternative, to manually pin the rxqs to PMDs on remote NUMA
> > nodes, also has drawbacks as it limits OVS' ability to auto load-balance the
> rxqs.
> >
> > This patch introduces a new interface configuration option to allow
> > ports to be automatically polled by PMDs on any NUMA node:
> >
> > ovs-vsctl set interface <Name> other_config:cross-numa-polling=true
> >
> > The group assignment algorithm now has the ability to select lowest
> > loaded PMD on any NUMA, and not just the local NUMA on which the rxq
> > of the port resides
> >
> > If this option is not present or set to false, legacy behaviour applies.
> >
> > Co-authored-by: Anurag Agarwal<[email protected]>
> > Signed-off-by: Jan Scheurich<[email protected]>
> > Signed-off-by: Anurag Agarwal<[email protected]>
> > ---
> > Changes in v5:
> > - Addressed comments from<[email protected]>
> > - First schedule rxqs that are not enabled for cross-numa scheduling
> > - Follow this with rxqs that are enabled for cross-numa scheduling
> >
> 
> I don't think this is a correct fix for the issue reported. The root problem
> reported is not really that rxqs with cross-numa=true are assigned first, but 
> that
> the pool or pmd resources is changing/overlapping during the assignments i.e. 
> in
> the reported case from a full pool to a fixed per-numa pool.
> 
> With the change you have now, you could have something like:
> 3 rxqs, (1 cross-numa) and 2 pmds.
> 
> cross-numa=true rxq load: 80%
> per-numa rxq loads: 45%, 40%
> 
> rxq assignment rounds
> 1.
> pmd0 = 45
> pmd1 = 0
> 
> 2.
> pmd0 = 45
> pmd1 = 40
> 
> 3.
> pmd0 = 45      = 40%
> pmd1 = 40 + 80 = 120%
> 
> when clearly the best way is:
> pmd0 = 80
> pmd1 = 45 + 40
> 

Could you help elaborate on this a bit more. Is PMD0 on NUMA0 and PMD1 on 
NUMA1? The two per-numa rxqs(45%, 40%) they belong to which NUMA? 
Need some more details to understand this scenario. 

> So it's not about which ones gets assigned first, as is shown that can cause 
> an
> issue whichever one is assigned first. The problem is that the algorithm for
> assignment is not designed for changing and overlapping ranges of pmds to
> assign rxqs to.

Here is the comment from [email protected]: 
> > > It may be better to schedule non-cross-numa rxqs first, and then 
> > > cross-numa rxqs.
> > > Otherwise, one numa may hold much more load because of later 
> > > scheduled non- cross-numa rxqs.

And here is the example shared:

> Considering the following situation:
> 
> We have 2 numa nodes, each numa node has 1 pmd.
> And we have 2 port(p0, p1), each port has 2 queues.
> p0 is configured as cross_numa, p1 is not.
> 
> each queue's workload,
> 
> rxq   p0q0 p0q1 p1q0 p1q1
> load  30   30   20   20
> 
> Based on your current implement, the assignment will be:
> 
> p0q0 -> numa0 (30)
> p0q1 -> numa1 (30)
> p1q0 -> numa0 (30+20=50)
> p1q1 -> numa0 (50+20=70)
> 
> As the result, numa0 holds 70% workload but numa1 holds only 30%.
> Because later assigned queues are numa affinity.
>

> 
> To fix this, you would probably need to do all the assignments first as per 
> v4 and
> then do another round of checking and possibly moving some cross-numa=true
> rxqs. But that is further relying on estimates which you are making 
> potentially
> inaccurate. If you are writing something to "move" individual rxqs after 
> initial
> assignment, maybe it's better to rethink doing it in the ALB with the real 
> loads
> and not estimates.
> 
> It is worth noting again that while less flexible, if the rxq load was 
> distributed on
> an interface with RSS etc, some pinning of phy rxqs can allow cross-numa and
> remove any inaccurate estimates from changing numa.
Do you mean pinning of phy rxqs can also be used alternatively to achieve 
cross-numa equivalent functionality although this is less flexible?

If scheduling cross-numa queues along with per numa queues leads to inaccurate 
assignment, should we revisit and think about enabling/supporting cross-numa 
polling at a global level to begin with?

> 
> thanks,
> Kevin.
> 
> > Changes in v4:
> > - Addressed comments from Kevin Traynor<[email protected]>
> >
> >   Please refer this thread for an earlier discussion on this topic
> >
> > https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-45444
> > 5555731-6f777962dd61a512&q=1&e=d01bde47-c6d2-4ad3-912a-
> 9f78067d9727&u=
> > https%3A%2F%2Fmail.openvswitch.org%2Fpipermail%2Fovs-dev%2F2022-
> March%
> > 2F392310.html
> >
> >   Documentation/topics/dpdk/pmd.rst |  23 +++++
> >   lib/dpif-netdev.c                 | 156 +++++++++++++++++++++++-------
> >   tests/pmd.at                      |  38 ++++++++
> >   vswitchd/vswitch.xml              |  20 ++++
> >   4 files changed, 201 insertions(+), 36 deletions(-)
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to