On 1/28/26 08:03, Jason Wang wrote: > On Wed, Jan 28, 2026 at 12:48 AM Simon Schippers > <[email protected]> wrote: >> >> On 1/23/26 10:54, Simon Schippers wrote: >>> On 1/23/26 04:05, Jason Wang wrote: >>>> On Thu, Jan 22, 2026 at 1:35 PM Jason Wang <[email protected]> wrote: >>>>> >>>>> On Wed, Jan 21, 2026 at 5:33 PM Simon Schippers >>>>> <[email protected]> wrote: >>>>>> >>>>>> On 1/9/26 07:02, Jason Wang wrote: >>>>>>> On Thu, Jan 8, 2026 at 3:41 PM Simon Schippers >>>>>>> <[email protected]> wrote: >>>>>>>> >>>>>>>> On 1/8/26 04:38, Jason Wang wrote: >>>>>>>>> On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers >>>>>>>>> <[email protected]> wrote: >>>>>>>>>> >>>>>>>>>> Introduce {tun,tap}_ring_consume() helpers that wrap >>>>>>>>>> __ptr_ring_consume() >>>>>>>>>> and wake the corresponding netdev subqueue when consuming an entry >>>>>>>>>> frees >>>>>>>>>> space in the underlying ptr_ring. >>>>>>>>>> >>>>>>>>>> Stopping of the netdev queue when the ptr_ring is full will be >>>>>>>>>> introduced >>>>>>>>>> in an upcoming commit. >>>>>>>>>> >>>>>>>>>> Co-developed-by: Tim Gebauer <[email protected]> >>>>>>>>>> Signed-off-by: Tim Gebauer <[email protected]> >>>>>>>>>> Signed-off-by: Simon Schippers <[email protected]> >>>>>>>>>> --- >>>>>>>>>> drivers/net/tap.c | 23 ++++++++++++++++++++++- >>>>>>>>>> drivers/net/tun.c | 25 +++++++++++++++++++++++-- >>>>>>>>>> 2 files changed, 45 insertions(+), 3 deletions(-) >>>>>>>>>> >>>>>>>>>> diff --git a/drivers/net/tap.c b/drivers/net/tap.c >>>>>>>>>> index 1197f245e873..2442cf7ac385 100644 >>>>>>>>>> --- a/drivers/net/tap.c >>>>>>>>>> +++ b/drivers/net/tap.c >>>>>>>>>> @@ -753,6 +753,27 @@ static ssize_t tap_put_user(struct tap_queue *q, >>>>>>>>>> return ret ? ret : total; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> +static void *tap_ring_consume(struct tap_queue *q) >>>>>>>>>> +{ >>>>>>>>>> + struct ptr_ring *ring = &q->ring; >>>>>>>>>> + struct net_device *dev; >>>>>>>>>> + void *ptr; >>>>>>>>>> + >>>>>>>>>> + spin_lock(&ring->consumer_lock); >>>>>>>>>> + >>>>>>>>>> + ptr = __ptr_ring_consume(ring); >>>>>>>>>> + if (unlikely(ptr && __ptr_ring_consume_created_space(ring, >>>>>>>>>> 1))) { >>>>>>>>>> + rcu_read_lock(); >>>>>>>>>> + dev = rcu_dereference(q->tap)->dev; >>>>>>>>>> + netif_wake_subqueue(dev, q->queue_index); >>>>>>>>>> + rcu_read_unlock(); >>>>>>>>>> + } >>>>>>>>>> + >>>>>>>>>> + spin_unlock(&ring->consumer_lock); >>>>>>>>>> + >>>>>>>>>> + return ptr; >>>>>>>>>> +} >>>>>>>>>> + >>>>>>>>>> static ssize_t tap_do_read(struct tap_queue *q, >>>>>>>>>> struct iov_iter *to, >>>>>>>>>> int noblock, struct sk_buff *skb) >>>>>>>>>> @@ -774,7 +795,7 @@ static ssize_t tap_do_read(struct tap_queue *q, >>>>>>>>>> TASK_INTERRUPTIBLE); >>>>>>>>>> >>>>>>>>>> /* Read frames from the queue */ >>>>>>>>>> - skb = ptr_ring_consume(&q->ring); >>>>>>>>>> + skb = tap_ring_consume(q); >>>>>>>>>> if (skb) >>>>>>>>>> break; >>>>>>>>>> if (noblock) { >>>>>>>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c >>>>>>>>>> index 8192740357a0..7148f9a844a4 100644 >>>>>>>>>> --- a/drivers/net/tun.c >>>>>>>>>> +++ b/drivers/net/tun.c >>>>>>>>>> @@ -2113,13 +2113,34 @@ static ssize_t tun_put_user(struct >>>>>>>>>> tun_struct *tun, >>>>>>>>>> return total; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> +static void *tun_ring_consume(struct tun_file *tfile) >>>>>>>>>> +{ >>>>>>>>>> + struct ptr_ring *ring = &tfile->tx_ring; >>>>>>>>>> + struct net_device *dev; >>>>>>>>>> + void *ptr; >>>>>>>>>> + >>>>>>>>>> + spin_lock(&ring->consumer_lock); >>>>>>>>>> + >>>>>>>>>> + ptr = __ptr_ring_consume(ring); >>>>>>>>>> + if (unlikely(ptr && __ptr_ring_consume_created_space(ring, >>>>>>>>>> 1))) { >>>>>>>>> >>>>>>>>> I guess it's the "bug" I mentioned in the previous patch that leads to >>>>>>>>> the check of __ptr_ring_consume_created_space() here. If it's true, >>>>>>>>> another call to tweak the current API. >>>>>>>>> >>>>>>>>>> + rcu_read_lock(); >>>>>>>>>> + dev = rcu_dereference(tfile->tun)->dev; >>>>>>>>>> + netif_wake_subqueue(dev, tfile->queue_index); >>>>>>>>> >>>>>>>>> This would cause the producer TX_SOFTIRQ to run on the same cpu which >>>>>>>>> I'm not sure is what we want. >>>>>>>> >>>>>>>> What else would you suggest calling to wake the queue? >>>>>>> >>>>>>> I don't have a good method in my mind, just want to point out its >>>>>>> implications. >>>>>> >>>>>> I have to admit I'm a bit stuck at this point, particularly with this >>>>>> aspect. >>>>>> >>>>>> What is the correct way to pass the producer CPU ID to the consumer? >>>>>> Would it make sense to store smp_processor_id() in the tfile inside >>>>>> tun_net_xmit(), or should it instead be stored in the skb (similar to the >>>>>> XDP bit)? In the latter case, my concern is that this information may >>>>>> already be significantly outdated by the time it is used. >>>>>> >>>>>> Based on that, my idea would be for the consumer to wake the producer by >>>>>> invoking a new function (e.g., tun_wake_queue()) on the producer CPU via >>>>>> smp_call_function_single(). >>>>>> Is this a reasonable approach? >>>>> >>>>> I'm not sure but it would introduce costs like IPI. >>>>> >>>>>> >>>>>> More generally, would triggering TX_SOFTIRQ on the consumer CPU be >>>>>> considered a deal-breaker for the patch set? >>>>> >>>>> It depends on whether or not it has effects on the performance. >>>>> Especially when vhost is pinned. >>>> >>>> I meant we can benchmark to see the impact. For example, pin vhost to >>>> a specific CPU and the try to see the impact of the TX_SOFTIRQ. >>>> >>>> Thanks >>>> >>> >>> I ran benchmarks with vhost pinned to CPU 0 using taskset -p -c 0 ... >>> for both the stock and patched versions. The benchmarks were run with >>> the full patch series applied, since testing only patches 1-3 would not >>> be meaningful - the queue is never stopped in that case, so no >>> TX_SOFTIRQ is triggered. >>> >>> Compared to the non-pinned CPU benchmarks in the cover letter, >>> performance is lower for pktgen with a single thread but higher with >>> four threads. The results show no regression for the patched version, >>> with even slight performance improvements observed: >>> >>> +-------------------------+-----------+----------------+ >>> | pktgen benchmarks to | Stock | Patched with | >>> | Debian VM, i5 6300HQ, | | fq_codel qdisc | >>> | 100M packets | | | >>> | vhost pinned to core 0 | | | >>> +-----------+-------------+-----------+----------------+ >>> | TAP | Transmitted | 452 Kpps | 454 Kpps | >>> | + +-------------+-----------+----------------+ >>> | vhost-net | Lost | 1154 Kpps | 0 | >>> +-----------+-------------+-----------+----------------+ >>> >>> +-------------------------+-----------+----------------+ >>> | pktgen benchmarks to | Stock | Patched with | >>> | Debian VM, i5 6300HQ, | | fq_codel qdisc | >>> | 100M packets | | | >>> | vhost pinned to core 0 | | | >>> | *4 threads* | | | >>> +-----------+-------------+-----------+----------------+ >>> | TAP | Transmitted | 71 Kpps | 79 Kpps | >>> | + +-------------+-----------+----------------+ >>> | vhost-net | Lost | 1527 Kpps | 0 | >>> +-----------+-------------+-----------+----------------+ > > The PPS seems to be low. I'd suggest using testpmd (rxonly) mode in > the guest or an xdp program that did XDP_DROP in the guest.
I forgot to mention that these PPS values are per thread. So overall we have 71 Kpps * 4 = 284 Kpps and 79 Kpps * 4 = 326 Kpps, respectively. For packet loss, that comes out to 1154 Kpps * 4 = 4616 Kpps and 0, respectively. Sorry about that! The pktgen benchmarks with a single thread look fine, right? I'll still look into using an XDP program that does XDP_DROP in the guest. Thanks! > >>> >>> +------------------------+-------------+----------------+ >>> | iperf3 TCP benchmarks | Stock | Patched with | >>> | to Debian VM 120s | | fq_codel qdisc | >>> | vhost pinned to core 0 | | | >>> +------------------------+-------------+----------------+ >>> | TAP | 22.0 Gbit/s | 22.0 Gbit/s | >>> | + | | | >>> | vhost-net | | | >>> +------------------------+-------------+----------------+ >>> >>> +---------------------------+-------------+----------------+ >>> | iperf3 TCP benchmarks | Stock | Patched with | >>> | to Debian VM 120s | | fq_codel qdisc | >>> | vhost pinned to core 0 | | | >>> | *4 iperf3 client threads* | | | >>> +---------------------------+-------------+----------------+ >>> | TAP | 21.4 Gbit/s | 21.5 Gbit/s | >>> | + | | | >>> | vhost-net | | | >>> +---------------------------+-------------+----------------+ >> >> What are your thoughts on this? >> >> Thanks! >> >> > > Thanks >

