>-----Original Message----- >From: Kevin Traynor <[email protected]> >Sent: Friday, 24 October 2025 18:40 >To: Eli Britstein <[email protected]>; Ilya Maximets <[email protected]>; >Eelco Chaudron <[email protected]>; [email protected] >Cc: Simon Horman <[email protected]>; Maor Dickman ><[email protected]> >Subject: Re: [ovs-dev] PMD Scheduling: Grouping RX Queues from Related Ports > >External email: Use caution opening links or attachments > > >On 23/10/2025 15:08, Eli Britstein wrote: >> >> >>> -----Original Message----- >>> From: Ilya Maximets <[email protected]> >>> Sent: Thursday, 23 October 2025 15:30 >>> To: Eelco Chaudron <[email protected]>; [email protected]; Kevin >>> Traynor <[email protected]> >>> Cc: Eli Britstein <[email protected]>; Simon Horman >>> <[email protected]>; Maor Dickman <[email protected]>; >>> [email protected] >>> Subject: Re: [ovs-dev] PMD Scheduling: Grouping RX Queues from >>> Related Ports >>> >>> External email: Use caution opening links or attachments >>> >>> >>> On 10/23/25 1:48 PM, Eelco Chaudron via dev wrote: >>>> Hi all, >>>> >>>> We’d like to bring a design discussion to the community regarding a >>>> requirement for RX queues from different ports to be grouped on the >>>> same >>> PMD. >>>> We’ve had some initial talks with the NVIDIA team (who are CC’d), >>>> and I think this discussion will benefit from upstream feedback and >involvement. >>>> >>>> Here is the background and context: >>>> >>>> The goal is to automatically (i.e., without user configuration) >>>> group together the same queue IDs from different, but related, >>>> ports. A key use case is an E-Switch manager (e.g., p0) and its VF >>>> representatives (e.g., >>> pf0vf0, pf0vf1). >>> >>> Could you explain why this is a requirement to poll the same queue ID >>> of different, though related, ports by the same thread? It's not obvious. >>> I suspect, in a typical setup with hardware offload most of the ports >>> will be related this way. >> [Eli Britstein] >> >> With DOCA ports, we call rx-burst only for the ESW manager port (of a >specific queue). In the same burst we get packets from this port (e.g. p0) as >well as of all its representors (pf0vf0, pf0vf1 etc). >> The HW is configured to set the mark field as a metadata with the port-id of >that packet. >> Then, we go over this burst and classify the packets, to a per-port (of that >queue #) data structure. >> OVS model is calling "input" per port. We then return the burst of that data >structure. >> Since this data structure is not thread safe, it works for us if we force the >processing of a specific queue for all those ports to be processed in the same >PMD thread. > >Hi All. IIUC, the lack of thread safety in netdev_rxq_recv calls to these >related >rxqs is the root of the issue. > >You mentioned the packets are already on a per-port/per-queue data structure >which seems good, but it's not thread safe. Can it be made thread safe ? > >> That PMD thread will loop over all of them (by its poll_list). For each it >> calls >netdev_rxq_recv(). >> Under the hood, we do the above (reading a burst from HW only for the ESW >manager, classifying and returning the classified burst). >> >> For the scheduling (just for a reference, in our downstream code) we did the >scheduling in 2 phases (change in sched_numa_list_schedule): >> The first iteration skips representor ports. Only ESW manager ports are >scheduled (summing up cycles if needed for itself and its representors). The >scheduled RXQs are kept in a list. >> The 2nd iteration schedules the representor ports. They are not scheduled >according to any algorithm but only get the scheduled PMD from the one of >their ESW manager (with the help of the list from the first iteration). >> >> This is tailored for DOCA mode. As part of the effort, we want to upstream >DOCA support we wanted a more generic support. >> > >rxq_scheduling() seems like the wrong layer to be trying to consider netdev >specific thread safety requirements. In terms of the actual grouping scheme, I >don't think it's something really generic that other netdevs would likely use. >So >baking a specific scheme into rxq_scheduling feels a bit dubious. > >Another issue worth to mention, is that the 'cycles' and 'roundrobin' >algorithms spread the number of rxqs evenly (as possible) to the available >cores, so that would conflict with this type of grouping from a user >perspective. [Eli Britstein] Making the data structure thread safe still has gaps. Yes, it will allow scheduling each rxq independently (at the cost of complexity, memory consumption and perhaps performance). However, bad scheduling can occur. For example, in a scenario in which the ESW port itself doesn't get packets, but its representors do. A ESW rxq can be scheduled on a PMD with another very busy port. Since the representors getting packets to process depend on RX of the ESW port, the result is their starvation. Then, in turn they will consume less cycles and could be scheduled worse.
In case they are grouped together, and cycles (for example) used is the sum of the group, this is avoided. There is another note about it that n_rxqs is meaningful to configure only on the ESW port. Representors cannot be independently configured but will get their ESW n_rxqs. However, this policy can be enforced in the netdev layer, so no DPIF support is required here. > >thanks, >Kevin. > >>> >>>> >>>> This new grouping logic must also respect existing scheduling >>>> algorithms like ‘cycles’. For example, if ‘cycles’ is used, the >>>> scheduler would need to base its decision on the sum of cycles for >>>> all RX >>> queues within that group. >>>> >>>> For this, we think we need some kind of netdev API that tells the >>>> rxq_scheduling() function which port-queues belong to a group. Once >>>> this group is known, the algorithm can perform the proper >>>> calculation on the >>> aggregated group. >>>> >>>> Does this approach sound reasonable? We are very open to other ideas >>>> on how to discover these related queues. >>>> >>>> Kevin, I’ve copied you in, as you did most of the existing >>>> implementation, so any feedback is appreciated. >>>> >>>> Cheers, >>>> Eelco _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
