On 1/23/26 04:05, Jason Wang wrote: > On Thu, Jan 22, 2026 at 1:35 PM Jason Wang <[email protected]> wrote: >> >> On Wed, Jan 21, 2026 at 5:33 PM Simon Schippers >> <[email protected]> wrote: >>> >>> On 1/9/26 07:02, Jason Wang wrote: >>>> On Thu, Jan 8, 2026 at 3:41 PM Simon Schippers >>>> <[email protected]> wrote: >>>>> >>>>> On 1/8/26 04:38, Jason Wang wrote: >>>>>> On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers >>>>>> <[email protected]> wrote: >>>>>>> >>>>>>> Introduce {tun,tap}_ring_consume() helpers that wrap >>>>>>> __ptr_ring_consume() >>>>>>> and wake the corresponding netdev subqueue when consuming an entry frees >>>>>>> space in the underlying ptr_ring. >>>>>>> >>>>>>> Stopping of the netdev queue when the ptr_ring is full will be >>>>>>> introduced >>>>>>> in an upcoming commit. >>>>>>> >>>>>>> Co-developed-by: Tim Gebauer <[email protected]> >>>>>>> Signed-off-by: Tim Gebauer <[email protected]> >>>>>>> Signed-off-by: Simon Schippers <[email protected]> >>>>>>> --- >>>>>>> drivers/net/tap.c | 23 ++++++++++++++++++++++- >>>>>>> drivers/net/tun.c | 25 +++++++++++++++++++++++-- >>>>>>> 2 files changed, 45 insertions(+), 3 deletions(-) >>>>>>> >>>>>>> diff --git a/drivers/net/tap.c b/drivers/net/tap.c >>>>>>> index 1197f245e873..2442cf7ac385 100644 >>>>>>> --- a/drivers/net/tap.c >>>>>>> +++ b/drivers/net/tap.c >>>>>>> @@ -753,6 +753,27 @@ static ssize_t tap_put_user(struct tap_queue *q, >>>>>>> return ret ? ret : total; >>>>>>> } >>>>>>> >>>>>>> +static void *tap_ring_consume(struct tap_queue *q) >>>>>>> +{ >>>>>>> + struct ptr_ring *ring = &q->ring; >>>>>>> + struct net_device *dev; >>>>>>> + void *ptr; >>>>>>> + >>>>>>> + spin_lock(&ring->consumer_lock); >>>>>>> + >>>>>>> + ptr = __ptr_ring_consume(ring); >>>>>>> + if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) >>>>>>> { >>>>>>> + rcu_read_lock(); >>>>>>> + dev = rcu_dereference(q->tap)->dev; >>>>>>> + netif_wake_subqueue(dev, q->queue_index); >>>>>>> + rcu_read_unlock(); >>>>>>> + } >>>>>>> + >>>>>>> + spin_unlock(&ring->consumer_lock); >>>>>>> + >>>>>>> + return ptr; >>>>>>> +} >>>>>>> + >>>>>>> static ssize_t tap_do_read(struct tap_queue *q, >>>>>>> struct iov_iter *to, >>>>>>> int noblock, struct sk_buff *skb) >>>>>>> @@ -774,7 +795,7 @@ static ssize_t tap_do_read(struct tap_queue *q, >>>>>>> TASK_INTERRUPTIBLE); >>>>>>> >>>>>>> /* Read frames from the queue */ >>>>>>> - skb = ptr_ring_consume(&q->ring); >>>>>>> + skb = tap_ring_consume(q); >>>>>>> if (skb) >>>>>>> break; >>>>>>> if (noblock) { >>>>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c >>>>>>> index 8192740357a0..7148f9a844a4 100644 >>>>>>> --- a/drivers/net/tun.c >>>>>>> +++ b/drivers/net/tun.c >>>>>>> @@ -2113,13 +2113,34 @@ static ssize_t tun_put_user(struct tun_struct >>>>>>> *tun, >>>>>>> return total; >>>>>>> } >>>>>>> >>>>>>> +static void *tun_ring_consume(struct tun_file *tfile) >>>>>>> +{ >>>>>>> + struct ptr_ring *ring = &tfile->tx_ring; >>>>>>> + struct net_device *dev; >>>>>>> + void *ptr; >>>>>>> + >>>>>>> + spin_lock(&ring->consumer_lock); >>>>>>> + >>>>>>> + ptr = __ptr_ring_consume(ring); >>>>>>> + if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) >>>>>>> { >>>>>> >>>>>> I guess it's the "bug" I mentioned in the previous patch that leads to >>>>>> the check of __ptr_ring_consume_created_space() here. If it's true, >>>>>> another call to tweak the current API. >>>>>> >>>>>>> + rcu_read_lock(); >>>>>>> + dev = rcu_dereference(tfile->tun)->dev; >>>>>>> + netif_wake_subqueue(dev, tfile->queue_index); >>>>>> >>>>>> This would cause the producer TX_SOFTIRQ to run on the same cpu which >>>>>> I'm not sure is what we want. >>>>> >>>>> What else would you suggest calling to wake the queue? >>>> >>>> I don't have a good method in my mind, just want to point out its >>>> implications. >>> >>> I have to admit I'm a bit stuck at this point, particularly with this >>> aspect. >>> >>> What is the correct way to pass the producer CPU ID to the consumer? >>> Would it make sense to store smp_processor_id() in the tfile inside >>> tun_net_xmit(), or should it instead be stored in the skb (similar to the >>> XDP bit)? In the latter case, my concern is that this information may >>> already be significantly outdated by the time it is used. >>> >>> Based on that, my idea would be for the consumer to wake the producer by >>> invoking a new function (e.g., tun_wake_queue()) on the producer CPU via >>> smp_call_function_single(). >>> Is this a reasonable approach? >> >> I'm not sure but it would introduce costs like IPI. >> >>> >>> More generally, would triggering TX_SOFTIRQ on the consumer CPU be >>> considered a deal-breaker for the patch set? >> >> It depends on whether or not it has effects on the performance. >> Especially when vhost is pinned. > > I meant we can benchmark to see the impact. For example, pin vhost to > a specific CPU and the try to see the impact of the TX_SOFTIRQ. > > Thanks >
I ran benchmarks with vhost pinned to CPU 0 using taskset -p -c 0 ... for both the stock and patched versions. The benchmarks were run with the full patch series applied, since testing only patches 1-3 would not be meaningful - the queue is never stopped in that case, so no TX_SOFTIRQ is triggered. Compared to the non-pinned CPU benchmarks in the cover letter, performance is lower for pktgen with a single thread but higher with four threads. The results show no regression for the patched version, with even slight performance improvements observed: +-------------------------+-----------+----------------+ | pktgen benchmarks to | Stock | Patched with | | Debian VM, i5 6300HQ, | | fq_codel qdisc | | 100M packets | | | | vhost pinned to core 0 | | | +-----------+-------------+-----------+----------------+ | TAP | Transmitted | 452 Kpps | 454 Kpps | | + +-------------+-----------+----------------+ | vhost-net | Lost | 1154 Kpps | 0 | +-----------+-------------+-----------+----------------+ +-------------------------+-----------+----------------+ | pktgen benchmarks to | Stock | Patched with | | Debian VM, i5 6300HQ, | | fq_codel qdisc | | 100M packets | | | | vhost pinned to core 0 | | | | *4 threads* | | | +-----------+-------------+-----------+----------------+ | TAP | Transmitted | 71 Kpps | 79 Kpps | | + +-------------+-----------+----------------+ | vhost-net | Lost | 1527 Kpps | 0 | +-----------+-------------+-----------+----------------+ +------------------------+-------------+----------------+ | iperf3 TCP benchmarks | Stock | Patched with | | to Debian VM 120s | | fq_codel qdisc | | vhost pinned to core 0 | | | +------------------------+-------------+----------------+ | TAP | 22.0 Gbit/s | 22.0 Gbit/s | | + | | | | vhost-net | | | +------------------------+-------------+----------------+ +---------------------------+-------------+----------------+ | iperf3 TCP benchmarks | Stock | Patched with | | to Debian VM 120s | | fq_codel qdisc | | vhost pinned to core 0 | | | | *4 iperf3 client threads* | | | +---------------------------+-------------+----------------+ | TAP | 21.4 Gbit/s | 21.5 Gbit/s | | + | | | | vhost-net | | | +---------------------------+-------------+----------------+

