Re: [ovs-dev] [PATCH v5] dpif-netdev: Allow cross-NUMA polling on selected ports

Jan Scheurich via dev Thu, 21 Jul 2022 01:01:41 -0700

Hi,

True, the cost of polling a packet from a physical port on a remote NUMA node 
is slightly higher than from local NUMA node. Hence the cross-NUMA polling of 
rx queues has some overhead. However, the packet processing cost is much more 
influence by the location of the target vhostuser ports. If the majority of the 
rx queue traffic is going to a VM on the other NUMA node, it is actually 
*better* to poll the packets in a PMD on the VM's NUMA node.


Long story short, OVS doesn't have sufficient data to be able to correctly 
predict the actual rxq load when assigned to another PMD in a different queue 
configuration. The rxq processing cycles measured on the current PMD is the 
best estimate we have for balancing the overall load on the PMDs. We need to 
live with the inevitable inaccuracies.

My main point is: these inaccuracies don't matter. The purpose of balancing the 
load over PMDs is *not* to minimize the total cycles spent by PMDs on 
processing packets. The PMD run in a busy loop anyhow and burn all cycles of 
the CPU. The purpose is to prevent that some PMD unnecessarily gets congested 
(i.e. load > 95%) while others have a lot of spare capacity and could take over 
some rxqs.

Cross-NUMA polling of physical port rxqs has proven to be an extremely valuable 
tool to help OVS's cycle-based rxq-balancing algorithm to do its job, and I 
strongly suggest we allow the proposed per-port opt-in option.

BR, Jan

From: Anurag Agarwal <[email protected]>
Sent: Thursday, 21 July 2022 07:15
To: [email protected]
Cc: Jan Scheurich <[email protected]>
Subject: RE: RE: [ovs-dev] [PATCH v5] dpif-netdev: Allow cross-NUMA polling on 
selected ports

Hello Cheng,
    With cross-numa enabled, we flatten the PMD list across NUMAs and select 
the least loaded PMD. Thus I would not like to consider the case below.

Regards,
Anurag

From: [email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>
Sent: Thursday, July 21, 2022 8:19 AM
To: Anurag Agarwal 
<[email protected]<mailto:[email protected]>>
Cc: Jan Scheurich 
<[email protected]<mailto:[email protected]>>
Subject: Re: RE: [ovs-dev] [PATCH v5] dpif-netdev: Allow cross-NUMA polling on 
selected ports

Hi Anurag,

"If local numa has bandwidth for rxq, we are not supposed to assign a rxq to 
remote pmd."
Would you like to consider this case? If not, I think we don't have to resolve 
the cycles measurement issue for cross numa case.

________________________________
李成

From: Anurag Agarwal<mailto:[email protected]>
Date: 2022-07-21 10:21
To: [email protected]<mailto:[email protected]>
CC: Jan Scheurich<mailto:[email protected]>
Subject: RE: RE: [ovs-dev] [PATCH v5] dpif-netdev: Allow cross-NUMA polling on 
selected ports
+ Jan

Hello Cheng,
      Thanks for your insightful comments. Please find my inputs inline.

Regards,
Anurag

From: [email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>
Sent: Monday, July 11, 2022 7:51 AM
To: Anurag Agarwal 
<[email protected]<mailto:[email protected]>>
Subject: Re: RE: [ovs-dev] [PATCH v5] dpif-netdev: Allow cross-NUMA polling on 
selected ports

Hi Anurag,

Sorry for late reply, I was busy on a task last two weeks.

I think you proposal can cover the case I reported. It looks good to me.
>> Thanks for your review and positive feedback

However, to enable cross numa rxq pollin, we may have another problem to 
address.

From my test, cross numa polling has worse performance than numa affinity 
polling.(at least 10%)
So if local numa has bandwidth for rxq, we are not supposed to assign a rxq to 
remote pmd.
Unfortunately, we don't know if a pmd is out of bandwidth from it's assigned 
rxq cycles.
Because rx batchs size impacts the rxq cycle a lot in my test:
rx batch          cycles per pkt
1.00                 5738
5.00                 2353
12.15               1770
32.00               1533

Pkts come faster, the rx batch size is larger. More rxqs a pmd is assigned, rx 
batch size if larger.
Imaging that pmd pA has only one rxq assigned. Pkts comes at 1.00 pkt/5738 
cycle, the rxq rx batch size is 1.00.
Now pA has 2 rxq assigned, each rxq has pkts comes at 1.00 pkt/5738 cycle.
pmd spends 5738 cycles process the first rxq, and then the second.
After the second rxq is processed, pmd comes back to the first rxq, now first 
rxq has 2 pkts ready(becase 2*5738 cycles passed).
The rxq batch size becomes 2.
>> Ok. Do you think it is a more generic problem with cycles measurement and 
>> PMD utilization? Not specific to cross-numa feature..

So it's hard to say if a pmd is overload from the rxq cycles.
At last, I think cross numa feature is very nice. I will make effort on this as 
well to cover cases in our company.
Let keep in sync on progress :)
>> Thanks


________________________________
李成

From: Anurag Agarwal<mailto:[email protected]>
Date: 2022-06-29 14:12
To: [email protected]<mailto:[email protected]>
Subject: RE: [ovs-dev] [PATCH v5] dpif-netdev: Allow cross-NUMA polling on 
selected ports
[Minor corrections in bullet numbering]

Hello [email protected]<mailto:[email protected]>,
I have the following proposal to consider on top of patch v4.
Initial Step
1) Sort rxqs in order of load from high to low
2) Calculate and predict total load on each NUMA for pinned rxqs as well as non-
pinned non cross-numa rxqs that will be assigned to this NUMA in future.
3) For  each given RXQ do the following:
4) If rxq is not cross-numa enabled, select local NUMA, select least loaded PMD
for the local NUMA. Add rxq load to the PMD.
5) If rxq is cross-numa enabled, select least loaded NUMA, then select the least
loaded PMD from that NUMA. Add the cross-numa rxq load to the assigned
NUMA.
We now go greedy in two steps. First  go greedy with the NUMA followed by a
PMD on the given NUMA during cross-numa assignments.
This addresses the issue reported in the thread due to flat PMD list, without
compromising on the greedy principle.
Let me know if you think this approach handles the issue you reported. Based on
your feedback I can follow up further.
Regards,
Anurag
> P.S. There might still be some issues if the no of PMDs on two NUMA is
> asymmetric and might need further discussion.
>
>
> > -----Original Message-----
> > From: Kevin Traynor <[email protected]<mailto:[email protected]>>
> > Sent: Tuesday, June 28, 2022 9:01 PM
> > To: Anurag Agarwal 
> > <[email protected]<mailto:[email protected]>>; Anurag 
> > Agarwal
> > <[email protected]<mailto:[email protected]>>; 
> > [email protected]<mailto:[email protected]>
> > Cc: [email protected]<mailto:[email protected]>
> > Subject: Re: [ovs-dev] [PATCH v5] dpif-netdev: Allow cross-NUMA
> > polling on selected ports
> >
> > On 27/06/2022 05:21, Anurag Agarwal wrote:
> > > Hello Kevin,
> > >
> > >> -----Original Message-----
> > >> From: Kevin Traynor <[email protected]<mailto:[email protected]>>
> > >> Sent: Thursday, June 23, 2022 9:07 PM
> > >> To: Anurag Agarwal <[email protected]<mailto:[email protected]>>; 
> > >> [email protected]<mailto:[email protected]>
> > >> Cc: [email protected]<mailto:[email protected]>
> > >> Subject: Re: [ovs-dev] [PATCH v5] dpif-netdev: Allow cross-NUMA
> > >> polling on selected ports
> > >>
> > >> Hi Anurag,
> > >>
> > >> On 23/06/2022 11:18, Anurag Agarwal wrote:
> > >>> From: Jan 
> > >>> Scheurich<[email protected]<mailto:[email protected]>>
> > >>>
> > >>> Today dpif-netdev considers PMD threads on a non-local NUMA node
> > >>> for automatic assignment of the rxqs of a port only if there are
> > >>> no
> > >>> local,non-
> > >> isolated PMDs.
> > >>>
> > >>> On typical servers with both physical ports on one NUMA node, this
> > >>> often leaves the PMDs on the other NUMA node under-utilized,
> > >>> wasting CPU
> > >> resources.
> > >>> The alternative, to manually pin the rxqs to PMDs on remote NUMA
> > >>> nodes, also has drawbacks as it limits OVS' ability to auto
> > >>> load-balance the
> > >> rxqs.
> > >>>
> > >>> This patch introduces a new interface configuration option to
> > >>> allow ports to be automatically polled by PMDs on any NUMA node:
> > >>>
> > >>> ovs-vsctl set interface <Name>
> > >>> other_config:cross-numa-polling=true
> > >>>
> > >>> The group assignment algorithm now has the ability to select
> > >>> lowest loaded PMD on any NUMA, and not just the local NUMA on
> > >>> which the rxq of the port resides
> > >>>
> > >>> If this option is not present or set to false, legacy behaviour applies.
> > >>>
> > >>> Co-authored-by: Anurag 
> > >>> Agarwal<[email protected]<mailto:[email protected]>>
> > >>> Signed-off-by: Jan 
> > >>> Scheurich<[email protected]<mailto:[email protected]>>
> > >>> Signed-off-by: Anurag 
> > >>> Agarwal<[email protected]<mailto:[email protected]>>
> > >>> ---
> > >>> Changes in v5:
> > >>> - Addressed comments 
> > >>> from<[email protected]<mailto:[email protected]>>
> > >>> - First schedule rxqs that are not enabled for cross-numa
> > >>> scheduling
> > >>> - Follow this with rxqs that are enabled for cross-numa scheduling
> > >>>
> > >>
> > >> I don't think this is a correct fix for the issue reported. The
> > >> root problem reported is not really that rxqs with cross-numa=true
> > >> are assigned first, but that the pool or pmd resources is
> > >> changing/overlapping during the assignments i.e. in the reported
> > >> case from a
> > full pool to a fixed per-numa pool.
> > >>
> > >> With the change you have now, you could have something like:
> > >> 3 rxqs, (1 cross-numa) and 2 pmds.
> > >>
> > >> cross-numa=true rxq load: 80%
> > >> per-numa rxq loads: 45%, 40%
> > >>
> > >> rxq assignment rounds
> > >> 1.
> > >> pmd0 = 45
> > >> pmd1 = 0
> > >>
> > >> 2.
> > >> pmd0 = 45
> > >> pmd1 = 40
> > >>
> > >> 3.
> > >> pmd0 = 45      = 40%
> > >> pmd1 = 40 + 80 = 120%
> > >>
> > >> when clearly the best way is:
> > >> pmd0 = 80
> > >> pmd1 = 45 + 40
> > >>
> > >
> > > Could you help elaborate on this a bit more. Is PMD0 on NUMA0 and
> > > PMD1 on
> > NUMA1? The two per-numa rxqs(45%, 40%) they belong to which NUMA?
> > > Need some more details to understand this scenario.
> > >
> >
> > pmd0 and pmd1 belong to NUMA 0. rxq 45% and rxq 40% belong to NUMA0.
> >
> > To simplify above, I've shown 2 rxq and 2 pmd on NUMA0 but that could
> > be replicated on other NUMAs as too.
> >
> > rxq 80% cross-numa could belong to any NUMA.
> >
> Thanks for the explanation and pointing this out. I think issue is because
> scheduling cross-numa rxq later breaks the greedy principle.
>
> > >> So it's not about which ones gets assigned first, as is shown that
> > >> can cause an issue whichever one is assigned first. The problem is
> > >> that the algorithm for assignment is not designed for changing and
> > >> overlapping ranges of pmds to assign rxqs to.
> > >
> > > Here is the comment from 
> > > [email protected]<mailto:[email protected]>:
> > >>>> It may be better to schedule non-cross-numa rxqs first, and then
> > >>>> cross-numa rxqs.
> > >>>> Otherwise, one numa may hold much more load because of later
> > >>>> scheduled non- cross-numa rxqs.
> > >
> > > And here is the example shared:
> > >
> > >> Considering the following situation:
> > >>
> > >> We have 2 numa nodes, each numa node has 1 pmd.
> > >> And we have 2 port(p0, p1), each port has 2 queues.
> > >> p0 is configured as cross_numa, p1 is not.
> > >>
> > >> each queue's workload,
> > >>
> > >> rxq   p0q0 p0q1 p1q0 p1q1
> > >> load  30   30   20   20
> > >>
> > >> Based on your current implement, the assignment will be:
> > >>
> > >> p0q0 -> numa0 (30)
> > >> p0q1 -> numa1 (30)
> > >> p1q0 -> numa0 (30+20=50)
> > >> p1q1 -> numa0 (50+20=70)
> > >>
> > >> As the result, numa0 holds 70% workload but numa1 holds only 30%.
> > >> Because later assigned queues are numa affinity.
> > >>
> > >
> >
> > Yes, I understand the comment and it highlights a potential issue. My
> > concern is that your solution only worksaround that issue, but causes
> > other potential issues (like in the example i gave) because it does not fix 
> > the
> root cause.
> >
> > >>
> > >> To fix this, you would probably need to do all the assignments
> > >> first as per v4
> > and
> > >> then do another round of checking and possibly moving some cross-
> > numa=true
> > >> rxqs. But that is further relying on estimates which you are making
> > >> potentially inaccurate. If you are writing something to "move"
> > >> individual rxqs after initial assignment, maybe it's better to
> > >> rethink doing it in the ALB with the real loads and not estimates.
> > >>
> > >> It is worth noting again that while less flexible, if the rxq load
> > >> was distributed
> > on
> > >> an interface with RSS etc, some pinning of phy rxqs can allow
> > >> cross-numa and remove any inaccurate estimates from changing numa.
> > > Do you mean pinning of phy rxqs can also be used alternatively to
> > > achieve
> > cross-numa equivalent functionality although this is less flexible?
> > >
> >
> > Yes, the user can pin rxqs to any NUMA, but OVS will not then reassign
> > them. It will however, consider the load from them when placing other
> > rxqs it can assign. I think the pros and cons were discussed in
> > previous threads related to these patches.
> >
> > > If scheduling cross-numa queues along with per numa queues leads to
> > inaccurate assignment, should we revisit and think about
> > enabling/supporting cross-numa polling at a global level to begin with?
> > >
> >
> > It would cover the problems that have been highlighted most recently,
> > yes, but it would also mean every rxq is open to moving NUMA and that
> > would mean more inaccuracies in estimates, leading to inaccurrate
> > assignments from that. It also might not be what is wanted by a user.
> > Even in your own case, I think you said you only want this to apply to
> > some interfaces.
> Agree there would be inaccuracy but that is because of the overhead due to
> cross-numa polling not being accounted for rather than due to the algorithm?
>
> >
> > >>
> > >> thanks,
> > >> Kevin.
> > >>
> > >>> Changes in v4:
> > >>> - Addressed comments from Kevin 
> > >>> Traynor<[email protected]<mailto:[email protected]>>
> > >>>
> > >>>    Please refer this thread for an earlier discussion on this
> > >>> topic
> > >>>
> > >>> https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-
> > 45444
> > >>> 5555731-6f777962dd61a512&q=1&e=d01bde47-c6d2-4ad3-912a-
> > >> 9f78067d9727&u=
> > >>> https%3A%2F%2Fmail.openvswitch.org%2Fpipermail%2Fovs-dev%2F2022-
> > >> March%
> > >>> 2F392310.html
> > >>>
> > >>>    Documentation/topics/dpdk/pmd.rst |  23 +++++
> > >>>    lib/dpif-netdev.c                 | 156 
> > >>> +++++++++++++++++++++++-------
> > >>>    tests/pmd.at                      |  38 ++++++++
> > >>>    vswitchd/vswitch.xml              |  20 ++++
> > >>>    4 files changed, 201 insertions(+), 36 deletions(-)
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v5] dpif-netdev: Allow cross-NUMA polling on selected ports

Reply via email to