Re: [ovs-dev] PMD Scheduling: Grouping RX Queues from Related Ports

Kevin Traynor via dev Fri, 24 Oct 2025 08:40:41 -0700

On 23/10/2025 15:08, Eli Britstein wrote:
> 
> 
>> -----Original Message-----
>> From: Ilya Maximets <[email protected]>
>> Sent: Thursday, 23 October 2025 15:30
>> To: Eelco Chaudron <[email protected]>; [email protected]; Kevin
>> Traynor <[email protected]>
>> Cc: Eli Britstein <[email protected]>; Simon Horman <[email protected]>;
>> Maor Dickman <[email protected]>; [email protected]
>> Subject: Re: [ovs-dev] PMD Scheduling: Grouping RX Queues from Related Ports
>>
>> External email: Use caution opening links or attachments
>>
>>
>> On 10/23/25 1:48 PM, Eelco Chaudron via dev wrote:
>>> Hi all,
>>>
>>> We’d like to bring a design discussion to the community regarding a
>>> requirement for RX queues from different ports to be grouped on the same
>> PMD.
>>> We’ve had some initial talks with the NVIDIA team (who are CC’d), and
>>> I think this discussion will benefit from upstream feedback and involvement.
>>>
>>> Here is the background and context:
>>>
>>> The goal is to automatically (i.e., without user configuration) group
>>> together the same queue IDs from different, but related, ports. A key
>>> use case is an E-Switch manager (e.g., p0) and its VF representatives (e.g.,
>> pf0vf0, pf0vf1).
>>
>> Could you explain why this is a requirement to poll the same queue ID of
>> different, though related, ports by the same thread?  It's not obvious.
>> I suspect, in a typical setup with hardware offload most of the ports will be
>> related this way.
> [Eli Britstein] 
> 
> With DOCA ports, we call rx-burst only for the ESW manager port (of a 
> specific queue). In the same burst we get packets from this port (e.g. p0) as 
> well as of all its representors (pf0vf0, pf0vf1 etc).
> The HW is configured to set the mark field as a metadata with the port-id of 
> that packet.
> Then, we go over this burst and classify the packets, to a per-port (of that 
> queue #) data structure.
> OVS model is calling "input" per port. We then return the burst of that data 
> structure.
> Since this data structure is not thread safe, it works for us if we force the 
> processing of a specific queue for all those ports to be processed in the 
> same PMD thread.


Hi All. IIUC, the lack of thread safety in netdev_rxq_recv calls to
these related rxqs is the root of the issue.

You mentioned the packets are already on a per-port/per-queue data
structure which seems good, but it's not thread safe. Can it be made
thread safe ?

> That PMD thread will loop over all of them (by its poll_list). For each it 
> calls netdev_rxq_recv().
> Under the hood, we do the above (reading a burst from HW only for the ESW 
> manager, classifying and returning the classified burst).
> 
> For the scheduling (just for a reference, in our downstream code) we did the 
> scheduling in 2 phases (change in sched_numa_list_schedule):
> The first iteration skips representor ports. Only ESW manager ports are 
> scheduled (summing up cycles if needed for itself and its representors). The 
> scheduled RXQs are kept in a list.
> The 2nd iteration schedules the representor ports. They are not scheduled 
> according to any algorithm but only get the scheduled PMD from the one of 
> their ESW manager (with the help of the list from the first iteration).
> 
> This is tailored for DOCA mode. As part of the effort, we want to upstream 
> DOCA support we wanted a more generic support.
> 

rxq_scheduling() seems like the wrong layer to be trying to consider
netdev specific thread safety requirements. In terms of the actual
grouping scheme, I don't think it's something really generic that other
netdevs would likely use. So baking a specific scheme into
rxq_scheduling feels a bit dubious.

Another issue worth to mention, is that the 'cycles' and 'roundrobin'
algorithms spread the number of rxqs evenly (as possible) to the
available cores, so that would conflict with this type of grouping from
a user perspective.

thanks,
Kevin.

>>
>>>
>>> This new grouping logic must also respect existing scheduling
>>> algorithms like ‘cycles’. For example, if ‘cycles’ is used, the
>>> scheduler would need to base its decision on the sum of cycles for all RX
>> queues within that group.
>>>
>>> For this, we think we need some kind of netdev API that tells the
>>> rxq_scheduling() function which port-queues belong to a group. Once
>>> this group is known, the algorithm can perform the proper calculation on the
>> aggregated group.
>>>
>>> Does this approach sound reasonable? We are very open to other ideas
>>> on how to discover these related queues.
>>>
>>> Kevin, I’ve copied you in, as you did most of the existing
>>> implementation, so any feedback is appreciated.
>>>
>>> Cheers,
>>> Eelco

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] PMD Scheduling: Grouping RX Queues from Related Ports

Reply via email to