tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present

Simon Schippers Fri, 27 Mar 2026 01:35:26 -0700

On 3/27/26 02:13, Jason Wang wrote:
> On Thu, Mar 26, 2026 at 11:31 PM Simon Schippers
> <[email protected]> wrote:
>>
>> On 3/26/26 03:41, Jason Wang wrote:
>>> On Wed, Mar 25, 2026 at 10:48 PM Simon Schippers
>>> <[email protected]> wrote:
>>>>
>>>> On 3/24/26 11:14, Simon Schippers wrote:
>>>>> On 3/24/26 02:47, Jason Wang wrote:
>>>>>> On Thu, Mar 12, 2026 at 9:07 PM Simon Schippers
>>>>>> <[email protected]> wrote:
>>>>>>>
>>>>>>> This commit prevents tail-drop when a qdisc is present and the ptr_ring
>>>>>>> becomes full. Once an entry is successfully produced and the ptr_ring
>>>>>>> reaches capacity, the netdev queue is stopped instead of dropping
>>>>>>> subsequent packets.
>>>>>>>
>>>>>>> If producing an entry fails anyways due to a race, tun_net_xmit returns
>>>>>>> NETDEV_TX_BUSY, again avoiding a drop. Such races are expected because
>>>>>>> LLTX is enabled and the transmit path operates without the usual 
>>>>>>> locking.
>>>>>>>
>>>>>>> The existing __tun_wake_queue() function wakes the netdev queue. Races
>>>>>>> between this wakeup and the queue-stop logic could leave the queue
>>>>>>> stopped indefinitely. To prevent this, a memory barrier is enforced
>>>>>>> (as discussed in a similar implementation in [1]), followed by a recheck
>>>>>>> that wakes the queue if space is already available.
>>>>>>>
>>>>>>> If no qdisc is present, the previous tail-drop behavior is preserved.
>>>>>>
>>>>>> I wonder if we need a dedicated TUN flag to enable this. With this new
>>>>>> flag, we can even prevent TUN from using noqueue (not sure if it's
>>>>>> possible or not).
>>>>>>
>>>>>
>>>>> Except of the slight regressions because of this patchset I do not see
>>>>> a reason for such a flag.
>>>>>
>>>>> I have never seen that the driver prevents noqueue. For example you can
>>>>> set noqueue to your ethernet interface and under load you soon get
>>>>>
>>>>> net_crit_ratelimited("Virtual device %s asks to queue packet!\n",
>>>>>                                            dev->name);
>>>>>
>>>>> followed by a -ENETDOWN. And this is not prevented even though it is
>>>>> clearly not something a user wants.
>>>>>
>>>>>>>
>>>>>>> Benchmarks:
>>>>>>> The benchmarks show a slight regression in raw transmission performance,
>>>>>>> though no packets are lost anymore.
>>>>>>>
>>>>>>> The previously introduced threshold to only wake after the queue stopped
>>>>>>> and half of the ring was consumed showed to be a descent choice:
>>>>>>> Waking the queue whenever a consume made space in the ring strongly
>>>>>>> degrades performance for tap, while waking only when the ring is empty
>>>>>>> is too late and also hurts throughput for tap & tap+vhost-net.
>>>>>>> Other ratios (3/4, 7/8) showed similar results (not shown here), so
>>>>>>> 1/2 was chosen for the sake of simplicity for both tun/tap and
>>>>>>> tun/tap+vhost-net.
>>>>>>>
>>>>>>> Test setup:
>>>>>>> AMD Ryzen 5 5600X at 4.3 GHz, 3200 MHz RAM, isolated QEMU threads;
>>>>>>> Average over 20 runs @ 100,000,000 packets. SRSO and spectre v2
>>>>>>> mitigations disabled.
>>>>>>>
>>>>>>> Note for tap+vhost-net:
>>>>>>> XDP drop program active in VM -> ~2.5x faster, slower for tap due to
>>>>>>> more syscalls (high utilization of entry_SYSRETQ_unsafe_stack in perf)
>>>>>>>
>>>>>>> +--------------------------+--------------+----------------+----------+
>>>>>>> | 1 thread                 | Stock        | Patched with   | diff     |
>>>>>>> | sending                  |              | fq_codel qdisc |          |
>>>>>>> +------------+-------------+--------------+----------------+----------+
>>>>>>> | TAP        | Transmitted | 1.151 Mpps   | 1.139 Mpps     | -1.1%    |
>>>>>>> |            +-------------+--------------+----------------+----------+
>>>>>>> |            | Lost/s      | 3.606 Mpps   | 0 pps          |          |
>>>>>>> +------------+-------------+--------------+----------------+----------+
>>>>>>> | TAP        | Transmitted | 3.948 Mpps   | 3.738 Mpps     | -5.3%    |
>>>>>>> |            +-------------+--------------+----------------+----------+
>>>>>>> | +vhost-net | Lost/s      | 496.5 Kpps   | 0 pps          |          |
>>>>>>> +------------+-------------+--------------+----------------+----------+
>>>>>>>
>>>>>>> +--------------------------+--------------+----------------+----------+
>>>>>>> | 2 threads                | Stock        | Patched with   | diff     |
>>>>>>> | sending                  |              | fq_codel qdisc |          |
>>>>>>> +------------+-------------+--------------+----------------+----------+
>>>>>>> | TAP        | Transmitted | 1.133 Mpps   | 1.109 Mpps     | -2.1%    |
>>>>>>> |            +-------------+--------------+----------------+----------+
>>>>>>> |            | Lost/s      | 8.269 Mpps   | 0 pps          |          |
>>>>>>> +------------+-------------+--------------+----------------+----------+
>>>>>>> | TAP        | Transmitted | 3.820 Mpps   | 3.513 Mpps     | -8.0%    |
>>>>>>> |            +-------------+--------------+----------------+----------+
>>>>>>> | +vhost-net | Lost/s      | 4.961 Mpps   | 0 pps          |          |
>>>>>>> +------------+-------------+--------------+----------------+----------+
>>>>>>>
>>>>>>> [1] Link: 
>>>>>>> https://lore.kernel.org/all/[email protected]/
>>>>>>>
>>>>>>> Co-developed-by: Tim Gebauer <[email protected]>
>>>>>>> Signed-off-by: Tim Gebauer <[email protected]>
>>>>>>> Signed-off-by: Simon Schippers <[email protected]>
>>>>>>> ---
>>>>>>>  drivers/net/tun.c | 30 ++++++++++++++++++++++++++++--
>>>>>>>  1 file changed, 28 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>>>>>>> index b86582cc6cb6..9b7daec69acd 100644
>>>>>>> --- a/drivers/net/tun.c
>>>>>>> +++ b/drivers/net/tun.c
>>>>>>> @@ -1011,6 +1011,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff 
>>>>>>> *skb, struct net_device *dev)
>>>>>>>         struct netdev_queue *queue;
>>>>>>>         struct tun_file *tfile;
>>>>>>>         int len = skb->len;
>>>>>>> +       bool qdisc_present;
>>>>>>> +       int ret;
>>>>>>>
>>>>>>>         rcu_read_lock();
>>>>>>>         tfile = rcu_dereference(tun->tfiles[txq]);
>>>>>>> @@ -1063,13 +1065,37 @@ static netdev_tx_t tun_net_xmit(struct sk_buff 
>>>>>>> *skb, struct net_device *dev)
>>>>>>>
>>>>>>>         nf_reset_ct(skb);
>>>>>>>
>>>>>>> -       if (ptr_ring_produce(&tfile->tx_ring, skb)) {
>>>>>>> +       queue = netdev_get_tx_queue(dev, txq);
>>>>>>> +       qdisc_present = !qdisc_txq_has_no_queue(queue);
>>>>>>> +
>>>>>>> +       spin_lock(&tfile->tx_ring.producer_lock);
>>>>>>> +       ret = __ptr_ring_produce(&tfile->tx_ring, skb);
>>>>>>> +       if (__ptr_ring_produce_peek(&tfile->tx_ring) && qdisc_present) {
>>>>>>
>>>>>> So, it's possible that the administrator is switching between noqueue
>>>>>> and another qdisc. So ptr_ring_produce() can fail here, do we need to
>>>>>> check that or not?
>>>>>>
>>>>>
>>>>> Do you mean that? My thoughts:
>>>>>
>>>>> Switching from noqueue to some qdisc can cause a
>>>>>
>>>>> net_crit_ratelimited("Virtual device %s asks to queue packet!\n",
>>>>>                                            dev->name);
>>>>>
>>>>> followed by a return of -ENETDOWN in __dev_queue_xmit().
>>>>> This is because tun_net_xmit detects some qdisc with
>>>>>
>>>>> qdisc_present = !qdisc_txq_has_no_queue(queue);
>>>>>
>>>>> and returns NETDEV_TX_BUSY even though __dev_queue_xmit() did still
>>>>> detect noqueue.
>>>>>
>>>>> I am not sure how to solve this/if this has to be solved.
>>>>> I do not see a proper way to avoid parallel execution of ndo_start_xmit
>>>>> and a qdisc change (dev_graft_qdisc only takes qdisc_skb_head lock).
>>>>>
>>>>> And from my understanding the veth implementation faces the same issue.
>>>>
>>>> How about rechecking if a qdisc is connected?
>>>> This would avoid -ENETDOWN.
>>>>
>>>> diff --git a/net/core/dev.c b/net/core/dev.c
>>>> index f48dc299e4b2..2731a1a70732 100644
>>>> --- a/net/core/dev.c
>>>> +++ b/net/core/dev.c
>>>> @@ -4845,10 +4845,17 @@ int __dev_queue_xmit(struct sk_buff *skb, struct 
>>>> net_device *sb_dev)
>>>>                                 if (is_list)
>>>>                                         rc = NETDEV_TX_OK;
>>>>                         }
>>>> +                       bool qdisc_present = !qdisc_txq_has_no_queue(txq);
>>>>                         HARD_TX_UNLOCK(dev, txq);
>>>>                         if (!skb) /* xmit completed */
>>>>                                 goto out;
>>>>
>>>> +                       /* Maybe a qdisc was connected in the meantime */
>>>> +                       if (qdisc_present) {
>>>> +                               kfree_skb(skb);
>>>> +                               goto out;
>>>> +                       }
>>>> +
>>>>                         net_crit_ratelimited("Virtual device %s asks to 
>>>> queue packet!\n",
>>>>                                              dev->name);
>>>>                         /* NETDEV_TX_BUSY or queue was stopped */
>>>>
>>>
>>> Probably not, and we likely won't hit this warning because qdisc could
>>> not be changed during ndo_start_xmit().
>>
>> Okay.
>>
>>>
>>> I meant something like this:
>>>
>>> 1) set noqueue to tuntap
>>> 2) produce packets so tuntap is full
>>> 3) set e.g fq_codel to tuntap
>>> 4) then we can hit the failure of __ptr_ring_produce()
>>>
>>> Rethink of the code, it looks just fine.
>>
>> Yes, in this case it just returns NETDEV_TX_BUSY which is fine with a
>> qdisc attached.
>>
>>>
>>>>
>>>>>
>>>>>
>>>>> Switching from some qdisc to noqueue is no problem I think.
>>>>>
>>>>>>> +               netif_tx_stop_queue(queue);
>>>>>>> +               /* Avoid races with queue wake-ups in __tun_wake_queue 
>>>>>>> by
>>>>>>> +                * waking if space is available in a re-check.
>>>>>>> +                * The barrier makes sure that the stop is visible 
>>>>>>> before
>>>>>>> +                * we re-check.
>>>>>>> +                */
>>>>>>> +               smp_mb__after_atomic();
>>>>>>
>>>>>> Let's document which barrier is paired with this.
>>>>>>
>>>>>
>>>>> I am basically copying the (old) logic of veth [1] proposed by
>>>>> Jakub Kicinski. I must admit I am not 100% sure what it pairs with.
>>>>>
>>>>> [1] Link: https://lore.kernel.org/all/[email protected]/
>>>
>>> So it looks like it implicitly tries to pair with tun_ring_consume():
>>>
>>> 1) spinlock(consumer_lock)
>>> 2) store NULL to ptr_ring // STORE
>>> 3) spinunlock(consumer_lock) // RELEASE
>>> 4) spinlock(consumer_lock) // ACQURE
>>> 5) check empty
>>> 6) spinunlock(consumer_lock)
>>> 7) netif_wakeup_queue() // test_and_set() which is an RMW
>>>
>>> RELEASE + ACQUIRE implies a full barrier
>>
>> Thanks.
>>
>>>
>>> I see several problems
>>>
>>> 1) Due to batch consumption, we may get spurious wakeups under heavy
>>> load (we can try disabling batch consuming to see if it helps).
>>
>> I assume that you mean the waking in the recheck of the producer happens
>> too often and then wakes too often. But this would just take slightly
>> more producer cpu as the SOFTIRQ runs on the producer cpu and not slow
>> down the consumer?
>>
>> Why would disabling batch consume help here?
> 
> We could end up with.
> 
> 1) consumer wakes up the producer but the slot is not cleaned


Ah okay.. I guess MST will not like this:

>From now on I assume that by "batch consumption" you mean that
__ptr_ring_zero_tail() is only called after consuming ring.batch
elements. And you want to disable it so __ptr_ring_zero_tail() is
called after each consumption.
(And I assume that you do not mean ptr_ring_consume_batched() in
vhost_net_buf_produce() as that does not make sense to me.)

I think my current implementation works fine with that:
The consumer only wakes after consuming >= ring.size/2.
And this is for TAP and TAP+vhost-net.
And ring.size/2 is always bigger than a single ring.batch (see
__ptr_ring_set_size() ), which ensures that __ptr_ring_zero_tail()
is called and the slot is cleaned before waking.

Waking on ptr_ring_empty() also ensures that __ptr_ring_zero_tail()
is called and the slot is cleaned before waking.

> 2) producer is woken up but see the ring is full, so it need to drop the 
> packet 
> 
> This probably defeat the goal of zero packet loss.

No then NETDEV_TX_BUSY is returned (noticeable as qdisc requeue).

And this happened very rarely for me in my benchmarks.
Here the average of the 20 runs @ 100.000.000 packets each:

TAP single queue: 23.6 requeues
TAP 2 queues: 35.5 requeues
TAP+vhost-net single queue: 1.9 requeues
TAP+vhost-net 2 queues: 12.1 requeues

> 
>> Wouldn't it just decrease the consumer speed?
> 
> Not sure, we probably need a benchmark.
> 
>>
>> Apart from that I do not see a different method to do this recheck.
>> The ring producer is only safely able to do a !produce_peek (so a check
>> for !full).
>>
>> The normal waking (after consuming half of the ring) should be fine IMO.
>>
>>> 2) So the barriers don't help but would slow down the consuming
>>> 3) Two spinlocks were used instead of one, this is another reason we
>>> will see a performance regression
>>
>> You are right, I can change it to a single spin_lock. Apart from that
>> I do not see how the barriers/locking could be reduced further.
>>
>>> 4) Tricky code that needs to be understood or at least requires a comment 
>>> tweak.
>>>
>>> Note that due to ~IFF_TX_SKB_SHARING, pktgen can't clone skbs, so we
>>> may not notice the real degradation.
>>
>> So run pktgen with pg_set SHARED?
> 
> Probably (as a workaround).
> 

Okay, for future benchmarks I will add this command to my benchmark
scripts.

>> I am pretty sure that the vhost
>> thread was always at 100% CPU so pktgen was not the bottleneck. And when
>> I had perf enabled I always saw that in my patched version not the
>> creation of SKB's took most CPU in pktgen but a different unnamed
>> function (I assume this is a waiting function).
> 
> Let's try and see.
> 
> Thanks
> 
>>
>>
>> Thank you!
>>
>>>
>>>>>
>>>>>>> +               if (!__ptr_ring_produce_peek(&tfile->tx_ring))
>>>>>>> +                       netif_tx_wake_queue(queue);
>>>>>>> +       }
>>>>>>> +       spin_unlock(&tfile->tx_ring.producer_lock);
>>>>>>> +
>>>>>>> +       if (ret) {
>>>>>>> +               /* If a qdisc is attached to our virtual device,
>>>>>>> +                * returning NETDEV_TX_BUSY is allowed.
>>>>>>> +                */
>>>>>>> +               if (qdisc_present) {
>>>>>>> +                       rcu_read_unlock();
>>>>>>> +                       return NETDEV_TX_BUSY;
>>>>>>> +               }
>>>>>>>                 drop_reason = SKB_DROP_REASON_FULL_RING;
>>>>>>>                 goto drop;
>>>>>>>         }
>>>>>>>
>>>>>>>         /* dev->lltx requires to do our own update of trans_start */
>>>>>>> -       queue = netdev_get_tx_queue(dev, txq);
>>>>>>>         txq_trans_cond_update(queue);
>>>>>>>
>>>>>>>         /* Notify and wake up reader process */
>>>>>>> --
>>>>>>> 2.43.0
>>>>>>>
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>
>>>
>>> Thanks
>>>
>>
>

Re: [PATCH net-next v8 4/4] tun/tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present

Reply via email to