Hello Kevin, > -----Original Message----- > From: Kevin Traynor <[email protected]> > Sent: Thursday, June 23, 2022 9:07 PM > To: Anurag Agarwal <[email protected]>; [email protected] > Cc: [email protected] > Subject: Re: [ovs-dev] [PATCH v5] dpif-netdev: Allow cross-NUMA polling on > selected ports > > Hi Anurag, > > On 23/06/2022 11:18, Anurag Agarwal wrote: > > From: Jan Scheurich<[email protected]> > > > > Today dpif-netdev considers PMD threads on a non-local NUMA node for > > automatic assignment of the rxqs of a port only if there are no local,non- > isolated PMDs. > > > > On typical servers with both physical ports on one NUMA node, this > > often leaves the PMDs on the other NUMA node under-utilized, wasting CPU > resources. > > The alternative, to manually pin the rxqs to PMDs on remote NUMA > > nodes, also has drawbacks as it limits OVS' ability to auto load-balance the > rxqs. > > > > This patch introduces a new interface configuration option to allow > > ports to be automatically polled by PMDs on any NUMA node: > > > > ovs-vsctl set interface <Name> other_config:cross-numa-polling=true > > > > The group assignment algorithm now has the ability to select lowest > > loaded PMD on any NUMA, and not just the local NUMA on which the rxq > > of the port resides > > > > If this option is not present or set to false, legacy behaviour applies. > > > > Co-authored-by: Anurag Agarwal<[email protected]> > > Signed-off-by: Jan Scheurich<[email protected]> > > Signed-off-by: Anurag Agarwal<[email protected]> > > --- > > Changes in v5: > > - Addressed comments from<[email protected]> > > - First schedule rxqs that are not enabled for cross-numa scheduling > > - Follow this with rxqs that are enabled for cross-numa scheduling > > > > I don't think this is a correct fix for the issue reported. The root problem > reported is not really that rxqs with cross-numa=true are assigned first, but > that > the pool or pmd resources is changing/overlapping during the assignments i.e. > in > the reported case from a full pool to a fixed per-numa pool. > > With the change you have now, you could have something like: > 3 rxqs, (1 cross-numa) and 2 pmds. > > cross-numa=true rxq load: 80% > per-numa rxq loads: 45%, 40% > > rxq assignment rounds > 1. > pmd0 = 45 > pmd1 = 0 > > 2. > pmd0 = 45 > pmd1 = 40 > > 3. > pmd0 = 45 = 40% > pmd1 = 40 + 80 = 120% > > when clearly the best way is: > pmd0 = 80 > pmd1 = 45 + 40 >
Could you help elaborate on this a bit more. Is PMD0 on NUMA0 and PMD1 on NUMA1? The two per-numa rxqs(45%, 40%) they belong to which NUMA? Need some more details to understand this scenario. > So it's not about which ones gets assigned first, as is shown that can cause > an > issue whichever one is assigned first. The problem is that the algorithm for > assignment is not designed for changing and overlapping ranges of pmds to > assign rxqs to. Here is the comment from [email protected]: > > > It may be better to schedule non-cross-numa rxqs first, and then > > > cross-numa rxqs. > > > Otherwise, one numa may hold much more load because of later > > > scheduled non- cross-numa rxqs. And here is the example shared: > Considering the following situation: > > We have 2 numa nodes, each numa node has 1 pmd. > And we have 2 port(p0, p1), each port has 2 queues. > p0 is configured as cross_numa, p1 is not. > > each queue's workload, > > rxq p0q0 p0q1 p1q0 p1q1 > load 30 30 20 20 > > Based on your current implement, the assignment will be: > > p0q0 -> numa0 (30) > p0q1 -> numa1 (30) > p1q0 -> numa0 (30+20=50) > p1q1 -> numa0 (50+20=70) > > As the result, numa0 holds 70% workload but numa1 holds only 30%. > Because later assigned queues are numa affinity. > > > To fix this, you would probably need to do all the assignments first as per > v4 and > then do another round of checking and possibly moving some cross-numa=true > rxqs. But that is further relying on estimates which you are making > potentially > inaccurate. If you are writing something to "move" individual rxqs after > initial > assignment, maybe it's better to rethink doing it in the ALB with the real > loads > and not estimates. > > It is worth noting again that while less flexible, if the rxq load was > distributed on > an interface with RSS etc, some pinning of phy rxqs can allow cross-numa and > remove any inaccurate estimates from changing numa. Do you mean pinning of phy rxqs can also be used alternatively to achieve cross-numa equivalent functionality although this is less flexible? If scheduling cross-numa queues along with per numa queues leads to inaccurate assignment, should we revisit and think about enabling/supporting cross-numa polling at a global level to begin with? > > thanks, > Kevin. > > > Changes in v4: > > - Addressed comments from Kevin Traynor<[email protected]> > > > > Please refer this thread for an earlier discussion on this topic > > > > https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-45444 > > 5555731-6f777962dd61a512&q=1&e=d01bde47-c6d2-4ad3-912a- > 9f78067d9727&u= > > https%3A%2F%2Fmail.openvswitch.org%2Fpipermail%2Fovs-dev%2F2022- > March% > > 2F392310.html > > > > Documentation/topics/dpdk/pmd.rst | 23 +++++ > > lib/dpif-netdev.c | 156 +++++++++++++++++++++++------- > > tests/pmd.at | 38 ++++++++ > > vswitchd/vswitch.xml | 20 ++++ > > 4 files changed, 201 insertions(+), 36 deletions(-) _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
