On 3/27/26 02:13, Jason Wang wrote:
> On Thu, Mar 26, 2026 at 11:31 PM Simon Schippers
> <[email protected]> wrote:
>>
>> On 3/26/26 03:41, Jason Wang wrote:
>>> On Wed, Mar 25, 2026 at 10:48 PM Simon Schippers
>>> <[email protected]> wrote:
>>>>
>>>> On 3/24/26 11:14, Simon Schippers wrote:
>>>>> On 3/24/26 02:47, Jason Wang wrote:
>>>>>> On Thu, Mar 12, 2026 at 9:07 PM Simon Schippers
>>>>>> <[email protected]> wrote:
>>>>>>>
>>>>>>> This commit prevents tail-drop when a qdisc is present and the ptr_ring
>>>>>>> becomes full. Once an entry is successfully produced and the ptr_ring
>>>>>>> reaches capacity, the netdev queue is stopped instead of dropping
>>>>>>> subsequent packets.
>>>>>>>
>>>>>>> If producing an entry fails anyways due to a race, tun_net_xmit returns
>>>>>>> NETDEV_TX_BUSY, again avoiding a drop. Such races are expected because
>>>>>>> LLTX is enabled and the transmit path operates without the usual
>>>>>>> locking.
>>>>>>>
>>>>>>> The existing __tun_wake_queue() function wakes the netdev queue. Races
>>>>>>> between this wakeup and the queue-stop logic could leave the queue
>>>>>>> stopped indefinitely. To prevent this, a memory barrier is enforced
>>>>>>> (as discussed in a similar implementation in [1]), followed by a recheck
>>>>>>> that wakes the queue if space is already available.
>>>>>>>
>>>>>>> If no qdisc is present, the previous tail-drop behavior is preserved.
>>>>>>
>>>>>> I wonder if we need a dedicated TUN flag to enable this. With this new
>>>>>> flag, we can even prevent TUN from using noqueue (not sure if it's
>>>>>> possible or not).
>>>>>>
>>>>>
>>>>> Except of the slight regressions because of this patchset I do not see
>>>>> a reason for such a flag.
>>>>>
>>>>> I have never seen that the driver prevents noqueue. For example you can
>>>>> set noqueue to your ethernet interface and under load you soon get
>>>>>
>>>>> net_crit_ratelimited("Virtual device %s asks to queue packet!\n",
>>>>> dev->name);
>>>>>
>>>>> followed by a -ENETDOWN. And this is not prevented even though it is
>>>>> clearly not something a user wants.
>>>>>
>>>>>>>
>>>>>>> Benchmarks:
>>>>>>> The benchmarks show a slight regression in raw transmission performance,
>>>>>>> though no packets are lost anymore.
>>>>>>>
>>>>>>> The previously introduced threshold to only wake after the queue stopped
>>>>>>> and half of the ring was consumed showed to be a descent choice:
>>>>>>> Waking the queue whenever a consume made space in the ring strongly
>>>>>>> degrades performance for tap, while waking only when the ring is empty
>>>>>>> is too late and also hurts throughput for tap & tap+vhost-net.
>>>>>>> Other ratios (3/4, 7/8) showed similar results (not shown here), so
>>>>>>> 1/2 was chosen for the sake of simplicity for both tun/tap and
>>>>>>> tun/tap+vhost-net.
>>>>>>>
>>>>>>> Test setup:
>>>>>>> AMD Ryzen 5 5600X at 4.3 GHz, 3200 MHz RAM, isolated QEMU threads;
>>>>>>> Average over 20 runs @ 100,000,000 packets. SRSO and spectre v2
>>>>>>> mitigations disabled.
>>>>>>>
>>>>>>> Note for tap+vhost-net:
>>>>>>> XDP drop program active in VM -> ~2.5x faster, slower for tap due to
>>>>>>> more syscalls (high utilization of entry_SYSRETQ_unsafe_stack in perf)
>>>>>>>
>>>>>>> +--------------------------+--------------+----------------+----------+
>>>>>>> | 1 thread | Stock | Patched with | diff |
>>>>>>> | sending | | fq_codel qdisc | |
>>>>>>> +------------+-------------+--------------+----------------+----------+
>>>>>>> | TAP | Transmitted | 1.151 Mpps | 1.139 Mpps | -1.1% |
>>>>>>> | +-------------+--------------+----------------+----------+
>>>>>>> | | Lost/s | 3.606 Mpps | 0 pps | |
>>>>>>> +------------+-------------+--------------+----------------+----------+
>>>>>>> | TAP | Transmitted | 3.948 Mpps | 3.738 Mpps | -5.3% |
>>>>>>> | +-------------+--------------+----------------+----------+
>>>>>>> | +vhost-net | Lost/s | 496.5 Kpps | 0 pps | |
>>>>>>> +------------+-------------+--------------+----------------+----------+
>>>>>>>
>>>>>>> +--------------------------+--------------+----------------+----------+
>>>>>>> | 2 threads | Stock | Patched with | diff |
>>>>>>> | sending | | fq_codel qdisc | |
>>>>>>> +------------+-------------+--------------+----------------+----------+
>>>>>>> | TAP | Transmitted | 1.133 Mpps | 1.109 Mpps | -2.1% |
>>>>>>> | +-------------+--------------+----------------+----------+
>>>>>>> | | Lost/s | 8.269 Mpps | 0 pps | |
>>>>>>> +------------+-------------+--------------+----------------+----------+
>>>>>>> | TAP | Transmitted | 3.820 Mpps | 3.513 Mpps | -8.0% |
>>>>>>> | +-------------+--------------+----------------+----------+
>>>>>>> | +vhost-net | Lost/s | 4.961 Mpps | 0 pps | |
>>>>>>> +------------+-------------+--------------+----------------+----------+
>>>>>>>
>>>>>>> [1] Link:
>>>>>>> https://lore.kernel.org/all/[email protected]/
>>>>>>>
>>>>>>> Co-developed-by: Tim Gebauer <[email protected]>
>>>>>>> Signed-off-by: Tim Gebauer <[email protected]>
>>>>>>> Signed-off-by: Simon Schippers <[email protected]>
>>>>>>> ---
>>>>>>> drivers/net/tun.c | 30 ++++++++++++++++++++++++++++--
>>>>>>> 1 file changed, 28 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>>>>>>> index b86582cc6cb6..9b7daec69acd 100644
>>>>>>> --- a/drivers/net/tun.c
>>>>>>> +++ b/drivers/net/tun.c
>>>>>>> @@ -1011,6 +1011,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff
>>>>>>> *skb, struct net_device *dev)
>>>>>>> struct netdev_queue *queue;
>>>>>>> struct tun_file *tfile;
>>>>>>> int len = skb->len;
>>>>>>> + bool qdisc_present;
>>>>>>> + int ret;
>>>>>>>
>>>>>>> rcu_read_lock();
>>>>>>> tfile = rcu_dereference(tun->tfiles[txq]);
>>>>>>> @@ -1063,13 +1065,37 @@ static netdev_tx_t tun_net_xmit(struct sk_buff
>>>>>>> *skb, struct net_device *dev)
>>>>>>>
>>>>>>> nf_reset_ct(skb);
>>>>>>>
>>>>>>> - if (ptr_ring_produce(&tfile->tx_ring, skb)) {
>>>>>>> + queue = netdev_get_tx_queue(dev, txq);
>>>>>>> + qdisc_present = !qdisc_txq_has_no_queue(queue);
>>>>>>> +
>>>>>>> + spin_lock(&tfile->tx_ring.producer_lock);
>>>>>>> + ret = __ptr_ring_produce(&tfile->tx_ring, skb);
>>>>>>> + if (__ptr_ring_produce_peek(&tfile->tx_ring) && qdisc_present) {
>>>>>>
>>>>>> So, it's possible that the administrator is switching between noqueue
>>>>>> and another qdisc. So ptr_ring_produce() can fail here, do we need to
>>>>>> check that or not?
>>>>>>
>>>>>
>>>>> Do you mean that? My thoughts:
>>>>>
>>>>> Switching from noqueue to some qdisc can cause a
>>>>>
>>>>> net_crit_ratelimited("Virtual device %s asks to queue packet!\n",
>>>>> dev->name);
>>>>>
>>>>> followed by a return of -ENETDOWN in __dev_queue_xmit().
>>>>> This is because tun_net_xmit detects some qdisc with
>>>>>
>>>>> qdisc_present = !qdisc_txq_has_no_queue(queue);
>>>>>
>>>>> and returns NETDEV_TX_BUSY even though __dev_queue_xmit() did still
>>>>> detect noqueue.
>>>>>
>>>>> I am not sure how to solve this/if this has to be solved.
>>>>> I do not see a proper way to avoid parallel execution of ndo_start_xmit
>>>>> and a qdisc change (dev_graft_qdisc only takes qdisc_skb_head lock).
>>>>>
>>>>> And from my understanding the veth implementation faces the same issue.
>>>>
>>>> How about rechecking if a qdisc is connected?
>>>> This would avoid -ENETDOWN.
>>>>
>>>> diff --git a/net/core/dev.c b/net/core/dev.c
>>>> index f48dc299e4b2..2731a1a70732 100644
>>>> --- a/net/core/dev.c
>>>> +++ b/net/core/dev.c
>>>> @@ -4845,10 +4845,17 @@ int __dev_queue_xmit(struct sk_buff *skb, struct
>>>> net_device *sb_dev)
>>>> if (is_list)
>>>> rc = NETDEV_TX_OK;
>>>> }
>>>> + bool qdisc_present = !qdisc_txq_has_no_queue(txq);
>>>> HARD_TX_UNLOCK(dev, txq);
>>>> if (!skb) /* xmit completed */
>>>> goto out;
>>>>
>>>> + /* Maybe a qdisc was connected in the meantime */
>>>> + if (qdisc_present) {
>>>> + kfree_skb(skb);
>>>> + goto out;
>>>> + }
>>>> +
>>>> net_crit_ratelimited("Virtual device %s asks to
>>>> queue packet!\n",
>>>> dev->name);
>>>> /* NETDEV_TX_BUSY or queue was stopped */
>>>>
>>>
>>> Probably not, and we likely won't hit this warning because qdisc could
>>> not be changed during ndo_start_xmit().
>>
>> Okay.
>>
>>>
>>> I meant something like this:
>>>
>>> 1) set noqueue to tuntap
>>> 2) produce packets so tuntap is full
>>> 3) set e.g fq_codel to tuntap
>>> 4) then we can hit the failure of __ptr_ring_produce()
>>>
>>> Rethink of the code, it looks just fine.
>>
>> Yes, in this case it just returns NETDEV_TX_BUSY which is fine with a
>> qdisc attached.
>>
>>>
>>>>
>>>>>
>>>>>
>>>>> Switching from some qdisc to noqueue is no problem I think.
>>>>>
>>>>>>> + netif_tx_stop_queue(queue);
>>>>>>> + /* Avoid races with queue wake-ups in __tun_wake_queue
>>>>>>> by
>>>>>>> + * waking if space is available in a re-check.
>>>>>>> + * The barrier makes sure that the stop is visible
>>>>>>> before
>>>>>>> + * we re-check.
>>>>>>> + */
>>>>>>> + smp_mb__after_atomic();
>>>>>>
>>>>>> Let's document which barrier is paired with this.
>>>>>>
>>>>>
>>>>> I am basically copying the (old) logic of veth [1] proposed by
>>>>> Jakub Kicinski. I must admit I am not 100% sure what it pairs with.
>>>>>
>>>>> [1] Link: https://lore.kernel.org/all/[email protected]/
>>>
>>> So it looks like it implicitly tries to pair with tun_ring_consume():
>>>
>>> 1) spinlock(consumer_lock)
>>> 2) store NULL to ptr_ring // STORE
>>> 3) spinunlock(consumer_lock) // RELEASE
>>> 4) spinlock(consumer_lock) // ACQURE
>>> 5) check empty
>>> 6) spinunlock(consumer_lock)
>>> 7) netif_wakeup_queue() // test_and_set() which is an RMW
>>>
>>> RELEASE + ACQUIRE implies a full barrier
>>
>> Thanks.
>>
>>>
>>> I see several problems
>>>
>>> 1) Due to batch consumption, we may get spurious wakeups under heavy
>>> load (we can try disabling batch consuming to see if it helps).
>>
>> I assume that you mean the waking in the recheck of the producer happens
>> too often and then wakes too often. But this would just take slightly
>> more producer cpu as the SOFTIRQ runs on the producer cpu and not slow
>> down the consumer?
>>
>> Why would disabling batch consume help here?
>
> We could end up with.
>
> 1) consumer wakes up the producer but the slot is not cleaned
Ah okay.. I guess MST will not like this:
>From now on I assume that by "batch consumption" you mean that
__ptr_ring_zero_tail() is only called after consuming ring.batch
elements. And you want to disable it so __ptr_ring_zero_tail() is
called after each consumption.
(And I assume that you do not mean ptr_ring_consume_batched() in
vhost_net_buf_produce() as that does not make sense to me.)
I think my current implementation works fine with that:
The consumer only wakes after consuming >= ring.size/2.
And this is for TAP and TAP+vhost-net.
And ring.size/2 is always bigger than a single ring.batch (see
__ptr_ring_set_size() ), which ensures that __ptr_ring_zero_tail()
is called and the slot is cleaned before waking.
Waking on ptr_ring_empty() also ensures that __ptr_ring_zero_tail()
is called and the slot is cleaned before waking.
> 2) producer is woken up but see the ring is full, so it need to drop the
> packet
>
> This probably defeat the goal of zero packet loss.
No then NETDEV_TX_BUSY is returned (noticeable as qdisc requeue).
And this happened very rarely for me in my benchmarks.
Here the average of the 20 runs @ 100.000.000 packets each:
TAP single queue: 23.6 requeues
TAP 2 queues: 35.5 requeues
TAP+vhost-net single queue: 1.9 requeues
TAP+vhost-net 2 queues: 12.1 requeues
>
>> Wouldn't it just decrease the consumer speed?
>
> Not sure, we probably need a benchmark.
>
>>
>> Apart from that I do not see a different method to do this recheck.
>> The ring producer is only safely able to do a !produce_peek (so a check
>> for !full).
>>
>> The normal waking (after consuming half of the ring) should be fine IMO.
>>
>>> 2) So the barriers don't help but would slow down the consuming
>>> 3) Two spinlocks were used instead of one, this is another reason we
>>> will see a performance regression
>>
>> You are right, I can change it to a single spin_lock. Apart from that
>> I do not see how the barriers/locking could be reduced further.
>>
>>> 4) Tricky code that needs to be understood or at least requires a comment
>>> tweak.
>>>
>>> Note that due to ~IFF_TX_SKB_SHARING, pktgen can't clone skbs, so we
>>> may not notice the real degradation.
>>
>> So run pktgen with pg_set SHARED?
>
> Probably (as a workaround).
>
Okay, for future benchmarks I will add this command to my benchmark
scripts.
>> I am pretty sure that the vhost
>> thread was always at 100% CPU so pktgen was not the bottleneck. And when
>> I had perf enabled I always saw that in my patched version not the
>> creation of SKB's took most CPU in pktgen but a different unnamed
>> function (I assume this is a waiting function).
>
> Let's try and see.
>
> Thanks
>
>>
>>
>> Thank you!
>>
>>>
>>>>>
>>>>>>> + if (!__ptr_ring_produce_peek(&tfile->tx_ring))
>>>>>>> + netif_tx_wake_queue(queue);
>>>>>>> + }
>>>>>>> + spin_unlock(&tfile->tx_ring.producer_lock);
>>>>>>> +
>>>>>>> + if (ret) {
>>>>>>> + /* If a qdisc is attached to our virtual device,
>>>>>>> + * returning NETDEV_TX_BUSY is allowed.
>>>>>>> + */
>>>>>>> + if (qdisc_present) {
>>>>>>> + rcu_read_unlock();
>>>>>>> + return NETDEV_TX_BUSY;
>>>>>>> + }
>>>>>>> drop_reason = SKB_DROP_REASON_FULL_RING;
>>>>>>> goto drop;
>>>>>>> }
>>>>>>>
>>>>>>> /* dev->lltx requires to do our own update of trans_start */
>>>>>>> - queue = netdev_get_tx_queue(dev, txq);
>>>>>>> txq_trans_cond_update(queue);
>>>>>>>
>>>>>>> /* Notify and wake up reader process */
>>>>>>> --
>>>>>>> 2.43.0
>>>>>>>
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>
>>>
>>> Thanks
>>>
>>
>