On Fri May 29, 2026 at 6:26 PM CEST, Kevin Traynor wrote: > On 5/28/26 10:29 AM, Eelco Chaudron wrote: > > > > > > On 27 May 2026, at 16:37, Gaetan Rivet wrote: > > > >> On Thu Apr 2, 2026 at 12:41 PM CEST, Kevin Traynor via dev wrote: > >>> On 4/1/26 1:03 PM, Eelco Chaudron via dev wrote: > >>>> > >>>> > >>>> On 1 Apr 2026, at 13:57, Eelco Chaudron via dev wrote: > >>>> > >>>>> This patch adds support for specific PMD thread initialization, > >>>>> deinitialization, and a callback execution to perform work as > >>>>> part of the PMD thread loop. This allows hardware offload > >>>>> providers to handle any specific asynchronous or batching work. > >>>>> > >>>>> This patch also adds cycle statistics for the provider-specific > >>>>> callbacks to the 'ovs-appctl dpif-netdev/pmd-perf-show' command. > >>>> > >>>> Bringing back the discussion on the earlier patch between Ilya and > >>>> Gaetan to this revision :) > >>>> > >>>> Ilya: > >>>> Hi, Eelco. As we talked before, this infrastructure resembles the > >>>> async > >>>> work infra that was proposed in the past for the use case of async > >>>> vhost > >>>> processing. And I don't see any real use case proposed for it here nor > >>>> in the RFC, where the question was asked, but not replied. > >>>> > >>>> Gaetan: > >>>> > >>> > >>> Hi Gaetan, > >>> > >>> A few questions below. I'm not so clear on the DOCA threading > >>> requirements, so questions may be broad. > >>> > >>>> Hi Ilya, Eelco, > >>>> > >>>> Thanks for the patch and for the review. > >>>> > >>>> The use-case on our side is distributed data-structures in DOCA that > >>>> requires each participating threads to do maintenance work > >>>> periodically. > >>>> > >>>> Specifically, offload threads will insert offload objects. > >>>> Those will reserve entries in a map that can be resized. The DOCA > >>>> implementation requires any thread that owns an entry to perform the > >>>> work of moving it to the new bucket / space after resize is initiated. > >>>> > >>>> This is a pervasive design choice in DOCA, they write most of their > >>>> APIs > >>>> assuming participating threads are periodically calling into these > >>>> maintenance functions. > >>>> > >>> > >>> What is a "particpating thread" ? IIUC, the pmd thread passes down the > >>> flow pattern/action and the offload thread inserts the offload into the > >>> NIC. > >>> > >>> In that case, is it the offload thread that owns the entry ? > >>> > >> > >> Participating threads are any threads that registered to DOCA-flow as > >> offloading threads. In our case, it means: > >> > >> * The main thread > >> --> When probing a port, starting it requires installing > >> DOCA offloads to execute RSS in particular, and a few other > >> 'admin' offloads (optional rate-limiting on VF to avoid > >> noisy-neighbors, etc). > >> > >> * The offload thread(s) (in the OVS sense) > >> A thread in OVS managing dp-flow offloads asynchronously. > >> > >> * The polling thread(s) > >> CT-offload is much simpler and faster than dp-flow offload. > >> Executing offload insertion synchronously from the fastpath > >> is beneficial. > >> > >> In our case, 'participating threads' are any thread owning an offload > >> queue in DOCA-flow. > >> > >> We have a few exceptions for the main thread, mainly that we force all > >> offload operations to be fully synchronous there: we do not want to > >> publish a new netdev if its 'admin' offloads have not yet been received > >> and successfully acknowledged by the hardware, so we force waiting > >> operations for it: it does not need to do regular upkeep etc. > >> > >>>> Some of such work is also time-sensitive, for example the current > >>>> implementation requires a CT offload thread to receive completions > >>>> after > >>>> some hardware initialization. Until this completion is done, the CT > >>>> offload entry is not fully usable (cannot be queried for activity / > >>>> counters). We cannot leave batches of CT offload entry waiting for > >>>> completion, assuming that at some later point, we will eventually > >>>> re-execute something in our offload provider: it leaves a few stranded > >>>> connection objects incomplete. > >>>> > >>>> This has the result of having hardware execution of a flow with CT > >>>> actions, but no activity counters: the software datapath then deletes > >>>> the connection and/or flow due to inactivity. > >>>> > >>> > >>> Can this periodic work be done by the offload thread ? If it is fast > >>> enough for inserting the offload, then maybe it is fast enough for this. > >>> > >> > >> The PMD thread owns the offload queue. If another thread has to execute > >> its upkeep work, it means sharing the queue between threads. > >> > >>> Some DPDK PMDs use alarms for periodic maintenance work, could they be > >>> used inside DOCA for this? > >>> > >> > >> Those upkeep functions are exposed by DOCA and part of the DOCA-flow > >> API. DOCA does not expose an event framework to schedule this kind of > >> work, it requires DOCA applications to explicitly call those functions. > >> > >>> If it needs to be on the PMD thread, is the work significant (i.e. more > >>> than a few % cpu) and how variable is it ? Could it be added inside the > >>> call to rte_eth_rx_burst polling ? > >>> > >> > >> It can be significant. > >> The work is anything requiring the use of the offload queue owned by > >> this thread. The principle is that the owning thread must execute it. > >> > >> Currently, with CT offloads we have: > >> > >> * offload queue polling for HW completion (requests have been > >> executed: add / mod / del were executed) > >> > >> * CT-del: A conn was offloaded by PMD 1. The connection either expired > >> or another PMD 2 closed it: ct-clean or PMD-2 send a CT-del > >> request to PMD-1: PMD-1 must poll for CT-del requests and > >> execute them locally. > >> > >> * Offload flush: when a port is deleted, all owning threads must > >> process a blocking flush request from the main thread. The main > >> thread only proceeds once all participating threads have completed > >> their flush. > >> > >> Completion is a very lightweight work, but we must execute it. > >> Generally we do only completion polling as needed: we only clear enough > >> room in the offload queue for the current batch of requests we want to > >> enqueue, but we have an issue on idle: some stray completion can > >> be left in the queue and won't be processed if we rely only on activity. > >> Currently DOCA-flow does not support leaving the completions until the > >> port is deleted: they need to be processed. > >> > >> CT-del can be significant in some cases. We have a 'rolling-window' case > >> of constant open + close of short connections, and in this worst case, > >> CT-del takes ~30% (both local and distant). Some portion of it comes from > >> CT-del messages, in particular in case of multiple PMDs. > >> > >> Offload flush is generally quick, but we must answer the flush message > >> quickly to block the main thread as little as possible. > >> > >> Some of the messages must be handled even if there is no RX-burst: a PMD > >> that is waiting for reload will need to execute a flush message that it > >> has received. > > > > Hi Gaetan, > > > > I guess Kevin is suggesting to hide this work in netdev_doca_rxq_recv(), > > as it will always be called as long as DOCA ports are present on the > > PMD. Or are there cases where this is not the case? > > > > dp_netdev_process_rxq_port() > > netdev_rxq_recv() > > netdev_doca_rxq_recv() > > > > Kevin, please confirm. > > Yes, that's what I was suggesting. The work is rxq specific and we > already have an rxq specific call that is called in a loop so why not do > it there and include the cycles needed for the maintenance work in the > measured cycles needed for that rxq. > > > > >> I think completions and flushes would be the main issues with the > >> rx-burst approach.
Hi, We had an issue with this kind of approach with flush commands. A PMD can be registered as a DOCA offload thread, in which case it will receive a blocking flush request on port deletion. This happens even if that port is not scheduled on that PMD. The issue arises when the PMD has no netdev-doca rxq scheduled: it is registered as a DOCA offload thread, but will never process its flush requests. A typical example might be on multi-NUMA, where by default 1 PMD is created per NUMA, and ports are configured with 1 rxq. With a single NIC, its rxq is configured on the closest PMD, leaving the other one idle. The idle PMD is still registered as a DOCA offload thread, as nothing forbids the user from adding a port on its NUMA at a future time. In this case, the idle PMD would never enter the right rxq-burst command to process its offload messages. All other cases would seem fine however, I think it almost works. I just don't have a solid approach for this flush issue. _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
