Hi, True, the cost of polling a packet from a physical port on a remote NUMA node is slightly higher than from local NUMA node. Hence the cross-NUMA polling of rx queues has some overhead. However, the packet processing cost is much more influence by the location of the target vhostuser ports. If the majority of the rx queue traffic is going to a VM on the other NUMA node, it is actually *better* to poll the packets in a PMD on the VM's NUMA node.
Long story short, OVS doesn't have sufficient data to be able to correctly predict the actual rxq load when assigned to another PMD in a different queue configuration. The rxq processing cycles measured on the current PMD is the best estimate we have for balancing the overall load on the PMDs. We need to live with the inevitable inaccuracies. My main point is: these inaccuracies don't matter. The purpose of balancing the load over PMDs is *not* to minimize the total cycles spent by PMDs on processing packets. The PMD run in a busy loop anyhow and burn all cycles of the CPU. The purpose is to prevent that some PMD unnecessarily gets congested (i.e. load > 95%) while others have a lot of spare capacity and could take over some rxqs. Cross-NUMA polling of physical port rxqs has proven to be an extremely valuable tool to help OVS's cycle-based rxq-balancing algorithm to do its job, and I strongly suggest we allow the proposed per-port opt-in option. BR, Jan From: Anurag Agarwal <[email protected]> Sent: Thursday, 21 July 2022 07:15 To: [email protected] Cc: Jan Scheurich <[email protected]> Subject: RE: RE: [ovs-dev] [PATCH v5] dpif-netdev: Allow cross-NUMA polling on selected ports Hello Cheng, With cross-numa enabled, we flatten the PMD list across NUMAs and select the least loaded PMD. Thus I would not like to consider the case below. Regards, Anurag From: [email protected]<mailto:[email protected]> <[email protected]<mailto:[email protected]>> Sent: Thursday, July 21, 2022 8:19 AM To: Anurag Agarwal <[email protected]<mailto:[email protected]>> Cc: Jan Scheurich <[email protected]<mailto:[email protected]>> Subject: Re: RE: [ovs-dev] [PATCH v5] dpif-netdev: Allow cross-NUMA polling on selected ports Hi Anurag, "If local numa has bandwidth for rxq, we are not supposed to assign a rxq to remote pmd." Would you like to consider this case? If not, I think we don't have to resolve the cycles measurement issue for cross numa case. ________________________________ 李成 From: Anurag Agarwal<mailto:[email protected]> Date: 2022-07-21 10:21 To: [email protected]<mailto:[email protected]> CC: Jan Scheurich<mailto:[email protected]> Subject: RE: RE: [ovs-dev] [PATCH v5] dpif-netdev: Allow cross-NUMA polling on selected ports + Jan Hello Cheng, Thanks for your insightful comments. Please find my inputs inline. Regards, Anurag From: [email protected]<mailto:[email protected]> <[email protected]<mailto:[email protected]>> Sent: Monday, July 11, 2022 7:51 AM To: Anurag Agarwal <[email protected]<mailto:[email protected]>> Subject: Re: RE: [ovs-dev] [PATCH v5] dpif-netdev: Allow cross-NUMA polling on selected ports Hi Anurag, Sorry for late reply, I was busy on a task last two weeks. I think you proposal can cover the case I reported. It looks good to me. >> Thanks for your review and positive feedback However, to enable cross numa rxq pollin, we may have another problem to address. From my test, cross numa polling has worse performance than numa affinity polling.(at least 10%) So if local numa has bandwidth for rxq, we are not supposed to assign a rxq to remote pmd. Unfortunately, we don't know if a pmd is out of bandwidth from it's assigned rxq cycles. Because rx batchs size impacts the rxq cycle a lot in my test: rx batch cycles per pkt 1.00 5738 5.00 2353 12.15 1770 32.00 1533 Pkts come faster, the rx batch size is larger. More rxqs a pmd is assigned, rx batch size if larger. Imaging that pmd pA has only one rxq assigned. Pkts comes at 1.00 pkt/5738 cycle, the rxq rx batch size is 1.00. Now pA has 2 rxq assigned, each rxq has pkts comes at 1.00 pkt/5738 cycle. pmd spends 5738 cycles process the first rxq, and then the second. After the second rxq is processed, pmd comes back to the first rxq, now first rxq has 2 pkts ready(becase 2*5738 cycles passed). The rxq batch size becomes 2. >> Ok. Do you think it is a more generic problem with cycles measurement and >> PMD utilization? Not specific to cross-numa feature.. So it's hard to say if a pmd is overload from the rxq cycles. At last, I think cross numa feature is very nice. I will make effort on this as well to cover cases in our company. Let keep in sync on progress :) >> Thanks ________________________________ 李成 From: Anurag Agarwal<mailto:[email protected]> Date: 2022-06-29 14:12 To: [email protected]<mailto:[email protected]> Subject: RE: [ovs-dev] [PATCH v5] dpif-netdev: Allow cross-NUMA polling on selected ports [Minor corrections in bullet numbering] Hello [email protected]<mailto:[email protected]>, I have the following proposal to consider on top of patch v4. Initial Step 1) Sort rxqs in order of load from high to low 2) Calculate and predict total load on each NUMA for pinned rxqs as well as non- pinned non cross-numa rxqs that will be assigned to this NUMA in future. 3) For each given RXQ do the following: 4) If rxq is not cross-numa enabled, select local NUMA, select least loaded PMD for the local NUMA. Add rxq load to the PMD. 5) If rxq is cross-numa enabled, select least loaded NUMA, then select the least loaded PMD from that NUMA. Add the cross-numa rxq load to the assigned NUMA. We now go greedy in two steps. First go greedy with the NUMA followed by a PMD on the given NUMA during cross-numa assignments. This addresses the issue reported in the thread due to flat PMD list, without compromising on the greedy principle. Let me know if you think this approach handles the issue you reported. Based on your feedback I can follow up further. Regards, Anurag > P.S. There might still be some issues if the no of PMDs on two NUMA is > asymmetric and might need further discussion. > > > > -----Original Message----- > > From: Kevin Traynor <[email protected]<mailto:[email protected]>> > > Sent: Tuesday, June 28, 2022 9:01 PM > > To: Anurag Agarwal > > <[email protected]<mailto:[email protected]>>; Anurag > > Agarwal > > <[email protected]<mailto:[email protected]>>; > > [email protected]<mailto:[email protected]> > > Cc: [email protected]<mailto:[email protected]> > > Subject: Re: [ovs-dev] [PATCH v5] dpif-netdev: Allow cross-NUMA > > polling on selected ports > > > > On 27/06/2022 05:21, Anurag Agarwal wrote: > > > Hello Kevin, > > > > > >> -----Original Message----- > > >> From: Kevin Traynor <[email protected]<mailto:[email protected]>> > > >> Sent: Thursday, June 23, 2022 9:07 PM > > >> To: Anurag Agarwal <[email protected]<mailto:[email protected]>>; > > >> [email protected]<mailto:[email protected]> > > >> Cc: [email protected]<mailto:[email protected]> > > >> Subject: Re: [ovs-dev] [PATCH v5] dpif-netdev: Allow cross-NUMA > > >> polling on selected ports > > >> > > >> Hi Anurag, > > >> > > >> On 23/06/2022 11:18, Anurag Agarwal wrote: > > >>> From: Jan > > >>> Scheurich<[email protected]<mailto:[email protected]>> > > >>> > > >>> Today dpif-netdev considers PMD threads on a non-local NUMA node > > >>> for automatic assignment of the rxqs of a port only if there are > > >>> no > > >>> local,non- > > >> isolated PMDs. > > >>> > > >>> On typical servers with both physical ports on one NUMA node, this > > >>> often leaves the PMDs on the other NUMA node under-utilized, > > >>> wasting CPU > > >> resources. > > >>> The alternative, to manually pin the rxqs to PMDs on remote NUMA > > >>> nodes, also has drawbacks as it limits OVS' ability to auto > > >>> load-balance the > > >> rxqs. > > >>> > > >>> This patch introduces a new interface configuration option to > > >>> allow ports to be automatically polled by PMDs on any NUMA node: > > >>> > > >>> ovs-vsctl set interface <Name> > > >>> other_config:cross-numa-polling=true > > >>> > > >>> The group assignment algorithm now has the ability to select > > >>> lowest loaded PMD on any NUMA, and not just the local NUMA on > > >>> which the rxq of the port resides > > >>> > > >>> If this option is not present or set to false, legacy behaviour applies. > > >>> > > >>> Co-authored-by: Anurag > > >>> Agarwal<[email protected]<mailto:[email protected]>> > > >>> Signed-off-by: Jan > > >>> Scheurich<[email protected]<mailto:[email protected]>> > > >>> Signed-off-by: Anurag > > >>> Agarwal<[email protected]<mailto:[email protected]>> > > >>> --- > > >>> Changes in v5: > > >>> - Addressed comments > > >>> from<[email protected]<mailto:[email protected]>> > > >>> - First schedule rxqs that are not enabled for cross-numa > > >>> scheduling > > >>> - Follow this with rxqs that are enabled for cross-numa scheduling > > >>> > > >> > > >> I don't think this is a correct fix for the issue reported. The > > >> root problem reported is not really that rxqs with cross-numa=true > > >> are assigned first, but that the pool or pmd resources is > > >> changing/overlapping during the assignments i.e. in the reported > > >> case from a > > full pool to a fixed per-numa pool. > > >> > > >> With the change you have now, you could have something like: > > >> 3 rxqs, (1 cross-numa) and 2 pmds. > > >> > > >> cross-numa=true rxq load: 80% > > >> per-numa rxq loads: 45%, 40% > > >> > > >> rxq assignment rounds > > >> 1. > > >> pmd0 = 45 > > >> pmd1 = 0 > > >> > > >> 2. > > >> pmd0 = 45 > > >> pmd1 = 40 > > >> > > >> 3. > > >> pmd0 = 45 = 40% > > >> pmd1 = 40 + 80 = 120% > > >> > > >> when clearly the best way is: > > >> pmd0 = 80 > > >> pmd1 = 45 + 40 > > >> > > > > > > Could you help elaborate on this a bit more. Is PMD0 on NUMA0 and > > > PMD1 on > > NUMA1? The two per-numa rxqs(45%, 40%) they belong to which NUMA? > > > Need some more details to understand this scenario. > > > > > > > pmd0 and pmd1 belong to NUMA 0. rxq 45% and rxq 40% belong to NUMA0. > > > > To simplify above, I've shown 2 rxq and 2 pmd on NUMA0 but that could > > be replicated on other NUMAs as too. > > > > rxq 80% cross-numa could belong to any NUMA. > > > Thanks for the explanation and pointing this out. I think issue is because > scheduling cross-numa rxq later breaks the greedy principle. > > > >> So it's not about which ones gets assigned first, as is shown that > > >> can cause an issue whichever one is assigned first. The problem is > > >> that the algorithm for assignment is not designed for changing and > > >> overlapping ranges of pmds to assign rxqs to. > > > > > > Here is the comment from > > > [email protected]<mailto:[email protected]>: > > >>>> It may be better to schedule non-cross-numa rxqs first, and then > > >>>> cross-numa rxqs. > > >>>> Otherwise, one numa may hold much more load because of later > > >>>> scheduled non- cross-numa rxqs. > > > > > > And here is the example shared: > > > > > >> Considering the following situation: > > >> > > >> We have 2 numa nodes, each numa node has 1 pmd. > > >> And we have 2 port(p0, p1), each port has 2 queues. > > >> p0 is configured as cross_numa, p1 is not. > > >> > > >> each queue's workload, > > >> > > >> rxq p0q0 p0q1 p1q0 p1q1 > > >> load 30 30 20 20 > > >> > > >> Based on your current implement, the assignment will be: > > >> > > >> p0q0 -> numa0 (30) > > >> p0q1 -> numa1 (30) > > >> p1q0 -> numa0 (30+20=50) > > >> p1q1 -> numa0 (50+20=70) > > >> > > >> As the result, numa0 holds 70% workload but numa1 holds only 30%. > > >> Because later assigned queues are numa affinity. > > >> > > > > > > > Yes, I understand the comment and it highlights a potential issue. My > > concern is that your solution only worksaround that issue, but causes > > other potential issues (like in the example i gave) because it does not fix > > the > root cause. > > > > >> > > >> To fix this, you would probably need to do all the assignments > > >> first as per v4 > > and > > >> then do another round of checking and possibly moving some cross- > > numa=true > > >> rxqs. But that is further relying on estimates which you are making > > >> potentially inaccurate. If you are writing something to "move" > > >> individual rxqs after initial assignment, maybe it's better to > > >> rethink doing it in the ALB with the real loads and not estimates. > > >> > > >> It is worth noting again that while less flexible, if the rxq load > > >> was distributed > > on > > >> an interface with RSS etc, some pinning of phy rxqs can allow > > >> cross-numa and remove any inaccurate estimates from changing numa. > > > Do you mean pinning of phy rxqs can also be used alternatively to > > > achieve > > cross-numa equivalent functionality although this is less flexible? > > > > > > > Yes, the user can pin rxqs to any NUMA, but OVS will not then reassign > > them. It will however, consider the load from them when placing other > > rxqs it can assign. I think the pros and cons were discussed in > > previous threads related to these patches. > > > > > If scheduling cross-numa queues along with per numa queues leads to > > inaccurate assignment, should we revisit and think about > > enabling/supporting cross-numa polling at a global level to begin with? > > > > > > > It would cover the problems that have been highlighted most recently, > > yes, but it would also mean every rxq is open to moving NUMA and that > > would mean more inaccuracies in estimates, leading to inaccurrate > > assignments from that. It also might not be what is wanted by a user. > > Even in your own case, I think you said you only want this to apply to > > some interfaces. > Agree there would be inaccuracy but that is because of the overhead due to > cross-numa polling not being accounted for rather than due to the algorithm? > > > > > >> > > >> thanks, > > >> Kevin. > > >> > > >>> Changes in v4: > > >>> - Addressed comments from Kevin > > >>> Traynor<[email protected]<mailto:[email protected]>> > > >>> > > >>> Please refer this thread for an earlier discussion on this > > >>> topic > > >>> > > >>> https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af- > > 45444 > > >>> 5555731-6f777962dd61a512&q=1&e=d01bde47-c6d2-4ad3-912a- > > >> 9f78067d9727&u= > > >>> https%3A%2F%2Fmail.openvswitch.org%2Fpipermail%2Fovs-dev%2F2022- > > >> March% > > >>> 2F392310.html > > >>> > > >>> Documentation/topics/dpdk/pmd.rst | 23 +++++ > > >>> lib/dpif-netdev.c | 156 > > >>> +++++++++++++++++++++++------- > > >>> tests/pmd.at | 38 ++++++++ > > >>> vswitchd/vswitch.xml | 20 ++++ > > >>> 4 files changed, 201 insertions(+), 36 deletions(-) _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
