On 30/10/2025 18:26, Eli Britstein wrote:
>
>
>> -----Original Message-----
>> From: Kevin Traynor <[email protected]>
>> Sent: Thursday, 30 October 2025 18:17
>> To: Eli Britstein <[email protected]>; Ilya Maximets <[email protected]>;
>> Eelco Chaudron <[email protected]>; [email protected]
>> Cc: Simon Horman <[email protected]>; Maor Dickman
>> <[email protected]>; Gaetan Rivet <[email protected]>
>> Subject: Re: [ovs-dev] PMD Scheduling: Grouping RX Queues from Related Ports
>>
>> External email: Use caution opening links or attachments
>>
>>
>> On 29/10/2025 16:44, Eli Britstein wrote:
>>>
>>>
>>>> -----Original Message-----
>>>> From: Kevin Traynor <[email protected]>
>>>> Sent: Wednesday, 29 October 2025 18:28
>>>> To: Eli Britstein <[email protected]>; Ilya Maximets
>>>> <[email protected]>; Eelco Chaudron <[email protected]>;
>>>> [email protected]
>>>> Cc: Simon Horman <[email protected]>; Maor Dickman
>>>> <[email protected]>; Gaetan Rivet <[email protected]>
>>>> Subject: Re: [ovs-dev] PMD Scheduling: Grouping RX Queues from
>>>> Related Ports
>>>>
>>>> External email: Use caution opening links or attachments
>>>>
>>>>
>>>> On 26/10/2025 08:24, Eli Britstein wrote:
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Kevin Traynor <[email protected]>
>>>>>> Sent: Friday, 24 October 2025 18:40
>>>>>> To: Eli Britstein <[email protected]>; Ilya Maximets
>>>>>> <[email protected]>; Eelco Chaudron <[email protected]>;
>>>>>> [email protected]
>>>>>> Cc: Simon Horman <[email protected]>; Maor Dickman
>>>>>> <[email protected]>
>>>>>> Subject: Re: [ovs-dev] PMD Scheduling: Grouping RX Queues from
>>>>>> Related Ports
>>>>>>
>>>>>> External email: Use caution opening links or attachments
>>>>>>
>>>>>>
>>>>>> On 23/10/2025 15:08, Eli Britstein wrote:
>>>>>>>
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Ilya Maximets <[email protected]>
>>>>>>>> Sent: Thursday, 23 October 2025 15:30
>>>>>>>> To: Eelco Chaudron <[email protected]>; [email protected];
>>>>>>>> Kevin Traynor <[email protected]>
>>>>>>>> Cc: Eli Britstein <[email protected]>; Simon Horman
>>>>>>>> <[email protected]>; Maor Dickman <[email protected]>;
>>>>>>>> [email protected]
>>>>>>>> Subject: Re: [ovs-dev] PMD Scheduling: Grouping RX Queues from
>>>>>>>> Related Ports
>>>>>>>>
>>>>>>>> External email: Use caution opening links or attachments
>>>>>>>>
>>>>>>>>
>>>>>>>> On 10/23/25 1:48 PM, Eelco Chaudron via dev wrote:
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> We’d like to bring a design discussion to the community
>>>>>>>>> regarding a requirement for RX queues from different ports to be
>>>>>>>>> grouped on the same
>>>>>>>> PMD.
>>>>>>>>> We’ve had some initial talks with the NVIDIA team (who are
>>>>>>>>> CC’d), and I think this discussion will benefit from upstream
>>>>>>>>> feedback and
>>>>>> involvement.
>>>>>>>>>
>>>>>>>>> Here is the background and context:
>>>>>>>>>
>>>>>>>>> The goal is to automatically (i.e., without user configuration)
>>>>>>>>> group together the same queue IDs from different, but related,
>>>>>>>>> ports. A key use case is an E-Switch manager (e.g., p0) and its
>>>>>>>>> VF representatives (e.g.,
>>>>>>>> pf0vf0, pf0vf1).
>>>>>>>>
>>>>>>>> Could you explain why this is a requirement to poll the same
>>>>>>>> queue ID of different, though related, ports by the same thread?
>>>>>>>> It's not
>>>> obvious.
>>>>>>>> I suspect, in a typical setup with hardware offload most of the
>>>>>>>> ports will be related this way.
>>>>>>> [Eli Britstein]
>>>>>>>
>>>>>>> With DOCA ports, we call rx-burst only for the ESW manager port
>>>>>>> (of a
>>>>>> specific queue). In the same burst we get packets from this port
>>>>>> (e.g. p0) as well as of all its representors (pf0vf0, pf0vf1 etc).
>>>>>>> The HW is configured to set the mark field as a metadata with the
>>>>>>> port-id of
>>>>>> that packet.
>>>>>>> Then, we go over this burst and classify the packets, to a
>>>>>>> per-port (of that
>>>>>> queue #) data structure.
>>>>>>> OVS model is calling "input" per port. We then return the burst of
>>>>>>> that data
>>>>>> structure.
>>>>>>> Since this data structure is not thread safe, it works for us if
>>>>>>> we force the
>>>>>> processing of a specific queue for all those ports to be processed
>>>>>> in the same PMD thread.
>>>>>>
>>>>>> Hi All. IIUC, the lack of thread safety in netdev_rxq_recv calls to
>>>>>> these related rxqs is the root of the issue.
>>>>>>
>>>>>> You mentioned the packets are already on a per-port/per-queue data
>>>>>> structure which seems good, but it's not thread safe. Can it be
>>>>>> made thread
>>>> safe ?
>>>>>>
>>>>>>> That PMD thread will loop over all of them (by its poll_list). For
>>>>>>> each it calls
>>>>>> netdev_rxq_recv().
>>>>>>> Under the hood, we do the above (reading a burst from HW only for
>>>>>>> the ESW
>>>>>> manager, classifying and returning the classified burst).
>>>>>>>
>>>>>>> For the scheduling (just for a reference, in our downstream code)
>>>>>>> we did the
>>>>>> scheduling in 2 phases (change in sched_numa_list_schedule):
>>>>>>> The first iteration skips representor ports. Only ESW manager
>>>>>>> ports are
>>>>>> scheduled (summing up cycles if needed for itself and its
>>>>>> representors). The scheduled RXQs are kept in a list.
>>>>>>> The 2nd iteration schedules the representor ports. They are not
>>>>>>> scheduled
>>>>>> according to any algorithm but only get the scheduled PMD from the
>>>>>> one of their ESW manager (with the help of the list from the first
>> iteration).
>>>>>>>
>>>>>>> This is tailored for DOCA mode. As part of the effort, we want to
>>>>>>> upstream
>>>>>> DOCA support we wanted a more generic support.
>>>>>>>
>>>>>>
>>>>>> rxq_scheduling() seems like the wrong layer to be trying to
>>>>>> consider netdev specific thread safety requirements. In terms of
>>>>>> the actual grouping scheme, I don't think it's something really
>>>>>> generic that other netdevs would likely use. So baking a specific
>>>>>> scheme into
>>>> rxq_scheduling feels a bit dubious.
>>>>>>
>>>>>> Another issue worth to mention, is that the 'cycles' and 'roundrobin'
>>>>>> algorithms spread the number of rxqs evenly (as possible) to the
>>>>>> available cores, so that would conflict with this type of grouping
>>>>>> from a user
>>>> perspective.
>>>>> [Eli Britstein]
>>>>> Making the data structure thread safe still has gaps.
>>>>> Yes, it will allow scheduling each rxq independently (at the cost of
>>>>> complexity,
>>>> memory consumption and perhaps performance).
>>>>> However, bad scheduling can occur.
>>>>> For example, in a scenario in which the ESW port itself doesn't get
>>>>> packets,
>>>> but its representors do. A ESW rxq can be scheduled on a PMD with
>>>> another very busy port.
>>>>> Since the representors getting packets to process depend on RX of
>>>>> the ESW
>>>> port, the result is their starvation. Then, in turn they will consume
>>>> less cycles and could be scheduled worse.
>>>>>
>>>>
>>>> Unless an rxq is pinned and the core is isolated then this can happen
>>>> for an rxq on any PMD core at any time. Even if the representor
>>>> queues are on the same PMD core, there is nothing to stop rxqs from
>>>> same or different ports delaying processing of the ESW rxq. That is
>>>> just the nature of multiple rxqs sharing a PMD core.
>>> [Eli Britstein]
>>> If they are scheduled as a group, it means the same thread is doing their RX
>> and processing. All packets read from the HW are handled in the same thread
>> loop.
>>> There won't be any new packets read from HW until all already read packets
>> (from all ports related) are processed.
>>> The point is that the thread doing the RX from HW and classifies the batch
>>> in
>> the data structure is the bottleneck for the other ports. If that RXQ gets
>> less
>> cycles, the related RXQs for representors are affected, even if they are
>> scheduled on a less busy thread.
>>> I can't see how such starvation can happen this way.
>>> Do I miss something?
>>>
>>
>> It might very well be me that's missing something. I'll try and explain
>> better so
>> we can identify any differences in understanding.
>>
>> The only issue I see with having the representors rxqs on different or same
>> pmd
>> cores as hw rxq is the need for thread safety in polling those rxqs.
>>
>> If the dependency that hw rxq polling is needed to fill packets for the
>> representor rxqs, then this will still be a dependency regardless of what
>> core
>> rxqs are polled by.
>>
>> That means that if the hw rxq is not polled in time, then it will be a
>> bottleneck
>> for receiving packets on the representor rxqs.
>>
>> The scenario where that bottleneck could occur is if there are other busy
>> rxqs
>> being polled by the same pmd thread core as the hw rxq so it doesn't get
>> enough cycles to run on time. Without user pinning/isolation etc any rxq from
>> any port may be polled by the same pmd core as the hw rxq and cause this
>> issue.
>>
>> If that occurs, it is a lack of resources/load balance issue, but I see it
>> that the
>> same issue could occur whether representor rxqs are polled by the same pmd
>> thread core as hw rxq or not.
> [Eli Britstein]
>
> Thanks for the explanation. I'll try to clarify more:
> Suppose we have the thread safety. The issue I point at is indeed a
> scheduling issue, but a one that is caused by the nature of this netdev.
> Normally, for example for netdev-dpdk, the work of each RXQ is done only on
> its scheduled PMD. Its cycles are calculated to include:
Yes, cycles counting starts here
> 1. netdev_rxq_recv(), e.g. rte_eth_rx_burst()
> 2. dp_netdev_input()
>
> In the case of netdev-doca, the work is split between ESW/REP RXQs:
> 1. netdev_rxq_recv()
> if (ESW port) {
> 1.1 rte_eth_rx_burst()
> 1.2 classify packets to data structure
> }
> 1.3 get packets of current port from data structure
> 2. dp_netdev_input()
>
and cycles counting stops here
> The work required to process REP packets is done partially in the ESW port
> RXQ 1.1-1.2, and the other part in the REP RXQ (1.3, 2).
> The cycles calculated for each port don't represent the actual work required.
> This may lead to incorrect scheduling.
The key part of the cycles calculated is that it captures the cycles
used in polling and processing packets from an rxq. Then that can be
used in calculations when assigning it to a pmd core.
In this case polling the ESW port rxq will be doing extra work in terms
of classification, filling the data struct struct etc but that is still
work that is required when polling that rxq, and will be accounted for
in the cycle measurements.
The only issue I see is that the cycles are not captured when there are
no packets on that rxqs. So, in that case, if there was no packets on
the ESW rxq, it could be doing work that is not accounted for.
Up till now, we have not accounted for polling when there are no packets
returned. In general, rx from hw is fast, but if that was significant
due to the nature of this netdev, indeed it would need to be accounted for.
Other than that issue when it comes then to assigning the ESW queue to a
pmd core, the extra work it is doing will be accounted for and it will
be scheduled accordingly.
The pmd-rxq-show will continue to show the the cycles that are being
measured as a % of the pmd cycles.
The stats may be a little mysterious for the user if they not aware of
the cooperation between these rxqs e.g. may wonder why polling ESW rxq
is taking so long when there are little packets on it, but that is just
due to the nature of this netdev.
>
> In the scenario I describe is that the ESW port doesn't have a lot of
> traffic, but its representors do.
> Since it is low on cycles, it may be scheduled on a PMD thread with another
> busy RXQ.
As long as the rxq/classification/filling data struct for representor
rxqs was insignificant, or significant and accounted for, it should be
scheduled correctly.
> The representors get a lot of traffic from the HW.
> However, since their ESW RXQ is slow, their packets are not processed fast
> enough leading their processing to be starved.
> In turn, their calculated cycles are getting falsely low which leads to
> another incorrect scheduling.
>
> If both RXQs are scheduled on the same PMD, and the cycles are calculated of
> the group all of this is avoided.
> The scheduling is done correctly, all packets from all ESW/REP ports are
> processed in the same PMD poll-list loop and no starvation can occur.
>
>>
>>>>
>>>> If that is the case the best way is to try and avoid an rxq that is
>>>> too heavily loaded is to add more rxqs to spread it's load (assuming
>>>> that rss will be effective).
>>>>
>>>>> In case they are grouped together, and cycles (for example) used is
>>>>> the sum of
>>>> the group, this is avoided.
>>>>>
>>>>
>>>> grouping rxqs, summing their cycles and assigning them to PMD cores
>>>> as a group means there is less granularity in the unit that is being
>>>> assigned. That makes balancing the load across available PMDs less
>>>> effective because where OVS could assign a single rxq to a PMD core
>>>> based on it's load, now it can only assign in groups.
>>> [Eli Britstein]
>>> Indeed, but that's how this type of netdev works.
>>>>
>>>>> There is another note about it that n_rxqs is meaningful to
>>>>> configure only on
>>>> the ESW port. Representors cannot be independently configured but
>>>> will get their ESW n_rxqs.
>>>>> However, this policy can be enforced in the netdev layer, so no DPIF
>>>>> support is
>>>> required here.
>>>>>>>
>>>>>> thanks,
>>>>>> Kevin.
>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> This new grouping logic must also respect existing scheduling
>>>>>>>>> algorithms like ‘cycles’. For example, if ‘cycles’ is used, the
>>>>>>>>> scheduler would need to base its decision on the sum of cycles
>>>>>>>>> for all RX
>>>>>>>> queues within that group.
>>>>>>>>>
>>>>>>>>> For this, we think we need some kind of netdev API that tells
>>>>>>>>> the
>>>>>>>>> rxq_scheduling() function which port-queues belong to a group.
>>>>>>>>> Once this group is known, the algorithm can perform the proper
>>>>>>>>> calculation on the
>>>>>>>> aggregated group.
>>>>>>>>>
>>>>>>>>> Does this approach sound reasonable? We are very open to other
>>>>>>>>> ideas on how to discover these related queues.
>>>>>>>>>
>>>>>>>>> Kevin, I’ve copied you in, as you did most of the existing
>>>>>>>>> implementation, so any feedback is appreciated.
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Eelco
>>>>>
>>>
>
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev