On Thu, Mar 26, 2026 at 11:31 PM Simon Schippers
<[email protected]> wrote:
>
> On 3/26/26 03:41, Jason Wang wrote:
> > On Wed, Mar 25, 2026 at 10:48 PM Simon Schippers
> > <[email protected]> wrote:
> >>
> >> On 3/24/26 11:14, Simon Schippers wrote:
> >>> On 3/24/26 02:47, Jason Wang wrote:
> >>>> On Thu, Mar 12, 2026 at 9:07 PM Simon Schippers
> >>>> <[email protected]> wrote:
> >>>>>
> >>>>> This commit prevents tail-drop when a qdisc is present and the ptr_ring
> >>>>> becomes full. Once an entry is successfully produced and the ptr_ring
> >>>>> reaches capacity, the netdev queue is stopped instead of dropping
> >>>>> subsequent packets.
> >>>>>
> >>>>> If producing an entry fails anyways due to a race, tun_net_xmit returns
> >>>>> NETDEV_TX_BUSY, again avoiding a drop. Such races are expected because
> >>>>> LLTX is enabled and the transmit path operates without the usual
> >>>>> locking.
> >>>>>
> >>>>> The existing __tun_wake_queue() function wakes the netdev queue. Races
> >>>>> between this wakeup and the queue-stop logic could leave the queue
> >>>>> stopped indefinitely. To prevent this, a memory barrier is enforced
> >>>>> (as discussed in a similar implementation in [1]), followed by a recheck
> >>>>> that wakes the queue if space is already available.
> >>>>>
> >>>>> If no qdisc is present, the previous tail-drop behavior is preserved.
> >>>>
> >>>> I wonder if we need a dedicated TUN flag to enable this. With this new
> >>>> flag, we can even prevent TUN from using noqueue (not sure if it's
> >>>> possible or not).
> >>>>
> >>>
> >>> Except of the slight regressions because of this patchset I do not see
> >>> a reason for such a flag.
> >>>
> >>> I have never seen that the driver prevents noqueue. For example you can
> >>> set noqueue to your ethernet interface and under load you soon get
> >>>
> >>> net_crit_ratelimited("Virtual device %s asks to queue packet!\n",
> >>> dev->name);
> >>>
> >>> followed by a -ENETDOWN. And this is not prevented even though it is
> >>> clearly not something a user wants.
> >>>
> >>>>>
> >>>>> Benchmarks:
> >>>>> The benchmarks show a slight regression in raw transmission performance,
> >>>>> though no packets are lost anymore.
> >>>>>
> >>>>> The previously introduced threshold to only wake after the queue stopped
> >>>>> and half of the ring was consumed showed to be a descent choice:
> >>>>> Waking the queue whenever a consume made space in the ring strongly
> >>>>> degrades performance for tap, while waking only when the ring is empty
> >>>>> is too late and also hurts throughput for tap & tap+vhost-net.
> >>>>> Other ratios (3/4, 7/8) showed similar results (not shown here), so
> >>>>> 1/2 was chosen for the sake of simplicity for both tun/tap and
> >>>>> tun/tap+vhost-net.
> >>>>>
> >>>>> Test setup:
> >>>>> AMD Ryzen 5 5600X at 4.3 GHz, 3200 MHz RAM, isolated QEMU threads;
> >>>>> Average over 20 runs @ 100,000,000 packets. SRSO and spectre v2
> >>>>> mitigations disabled.
> >>>>>
> >>>>> Note for tap+vhost-net:
> >>>>> XDP drop program active in VM -> ~2.5x faster, slower for tap due to
> >>>>> more syscalls (high utilization of entry_SYSRETQ_unsafe_stack in perf)
> >>>>>
> >>>>> +--------------------------+--------------+----------------+----------+
> >>>>> | 1 thread | Stock | Patched with | diff |
> >>>>> | sending | | fq_codel qdisc | |
> >>>>> +------------+-------------+--------------+----------------+----------+
> >>>>> | TAP | Transmitted | 1.151 Mpps | 1.139 Mpps | -1.1% |
> >>>>> | +-------------+--------------+----------------+----------+
> >>>>> | | Lost/s | 3.606 Mpps | 0 pps | |
> >>>>> +------------+-------------+--------------+----------------+----------+
> >>>>> | TAP | Transmitted | 3.948 Mpps | 3.738 Mpps | -5.3% |
> >>>>> | +-------------+--------------+----------------+----------+
> >>>>> | +vhost-net | Lost/s | 496.5 Kpps | 0 pps | |
> >>>>> +------------+-------------+--------------+----------------+----------+
> >>>>>
> >>>>> +--------------------------+--------------+----------------+----------+
> >>>>> | 2 threads | Stock | Patched with | diff |
> >>>>> | sending | | fq_codel qdisc | |
> >>>>> +------------+-------------+--------------+----------------+----------+
> >>>>> | TAP | Transmitted | 1.133 Mpps | 1.109 Mpps | -2.1% |
> >>>>> | +-------------+--------------+----------------+----------+
> >>>>> | | Lost/s | 8.269 Mpps | 0 pps | |
> >>>>> +------------+-------------+--------------+----------------+----------+
> >>>>> | TAP | Transmitted | 3.820 Mpps | 3.513 Mpps | -8.0% |
> >>>>> | +-------------+--------------+----------------+----------+
> >>>>> | +vhost-net | Lost/s | 4.961 Mpps | 0 pps | |
> >>>>> +------------+-------------+--------------+----------------+----------+
> >>>>>
> >>>>> [1] Link:
> >>>>> https://lore.kernel.org/all/[email protected]/
> >>>>>
> >>>>> Co-developed-by: Tim Gebauer <[email protected]>
> >>>>> Signed-off-by: Tim Gebauer <[email protected]>
> >>>>> Signed-off-by: Simon Schippers <[email protected]>
> >>>>> ---
> >>>>> drivers/net/tun.c | 30 ++++++++++++++++++++++++++++--
> >>>>> 1 file changed, 28 insertions(+), 2 deletions(-)
> >>>>>
> >>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> >>>>> index b86582cc6cb6..9b7daec69acd 100644
> >>>>> --- a/drivers/net/tun.c
> >>>>> +++ b/drivers/net/tun.c
> >>>>> @@ -1011,6 +1011,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff
> >>>>> *skb, struct net_device *dev)
> >>>>> struct netdev_queue *queue;
> >>>>> struct tun_file *tfile;
> >>>>> int len = skb->len;
> >>>>> + bool qdisc_present;
> >>>>> + int ret;
> >>>>>
> >>>>> rcu_read_lock();
> >>>>> tfile = rcu_dereference(tun->tfiles[txq]);
> >>>>> @@ -1063,13 +1065,37 @@ static netdev_tx_t tun_net_xmit(struct sk_buff
> >>>>> *skb, struct net_device *dev)
> >>>>>
> >>>>> nf_reset_ct(skb);
> >>>>>
> >>>>> - if (ptr_ring_produce(&tfile->tx_ring, skb)) {
> >>>>> + queue = netdev_get_tx_queue(dev, txq);
> >>>>> + qdisc_present = !qdisc_txq_has_no_queue(queue);
> >>>>> +
> >>>>> + spin_lock(&tfile->tx_ring.producer_lock);
> >>>>> + ret = __ptr_ring_produce(&tfile->tx_ring, skb);
> >>>>> + if (__ptr_ring_produce_peek(&tfile->tx_ring) && qdisc_present) {
> >>>>
> >>>> So, it's possible that the administrator is switching between noqueue
> >>>> and another qdisc. So ptr_ring_produce() can fail here, do we need to
> >>>> check that or not?
> >>>>
> >>>
> >>> Do you mean that? My thoughts:
> >>>
> >>> Switching from noqueue to some qdisc can cause a
> >>>
> >>> net_crit_ratelimited("Virtual device %s asks to queue packet!\n",
> >>> dev->name);
> >>>
> >>> followed by a return of -ENETDOWN in __dev_queue_xmit().
> >>> This is because tun_net_xmit detects some qdisc with
> >>>
> >>> qdisc_present = !qdisc_txq_has_no_queue(queue);
> >>>
> >>> and returns NETDEV_TX_BUSY even though __dev_queue_xmit() did still
> >>> detect noqueue.
> >>>
> >>> I am not sure how to solve this/if this has to be solved.
> >>> I do not see a proper way to avoid parallel execution of ndo_start_xmit
> >>> and a qdisc change (dev_graft_qdisc only takes qdisc_skb_head lock).
> >>>
> >>> And from my understanding the veth implementation faces the same issue.
> >>
> >> How about rechecking if a qdisc is connected?
> >> This would avoid -ENETDOWN.
> >>
> >> diff --git a/net/core/dev.c b/net/core/dev.c
> >> index f48dc299e4b2..2731a1a70732 100644
> >> --- a/net/core/dev.c
> >> +++ b/net/core/dev.c
> >> @@ -4845,10 +4845,17 @@ int __dev_queue_xmit(struct sk_buff *skb, struct
> >> net_device *sb_dev)
> >> if (is_list)
> >> rc = NETDEV_TX_OK;
> >> }
> >> + bool qdisc_present = !qdisc_txq_has_no_queue(txq);
> >> HARD_TX_UNLOCK(dev, txq);
> >> if (!skb) /* xmit completed */
> >> goto out;
> >>
> >> + /* Maybe a qdisc was connected in the meantime */
> >> + if (qdisc_present) {
> >> + kfree_skb(skb);
> >> + goto out;
> >> + }
> >> +
> >> net_crit_ratelimited("Virtual device %s asks to
> >> queue packet!\n",
> >> dev->name);
> >> /* NETDEV_TX_BUSY or queue was stopped */
> >>
> >
> > Probably not, and we likely won't hit this warning because qdisc could
> > not be changed during ndo_start_xmit().
>
> Okay.
>
> >
> > I meant something like this:
> >
> > 1) set noqueue to tuntap
> > 2) produce packets so tuntap is full
> > 3) set e.g fq_codel to tuntap
> > 4) then we can hit the failure of __ptr_ring_produce()
> >
> > Rethink of the code, it looks just fine.
>
> Yes, in this case it just returns NETDEV_TX_BUSY which is fine with a
> qdisc attached.
>
> >
> >>
> >>>
> >>>
> >>> Switching from some qdisc to noqueue is no problem I think.
> >>>
> >>>>> + netif_tx_stop_queue(queue);
> >>>>> + /* Avoid races with queue wake-ups in __tun_wake_queue
> >>>>> by
> >>>>> + * waking if space is available in a re-check.
> >>>>> + * The barrier makes sure that the stop is visible
> >>>>> before
> >>>>> + * we re-check.
> >>>>> + */
> >>>>> + smp_mb__after_atomic();
> >>>>
> >>>> Let's document which barrier is paired with this.
> >>>>
> >>>
> >>> I am basically copying the (old) logic of veth [1] proposed by
> >>> Jakub Kicinski. I must admit I am not 100% sure what it pairs with.
> >>>
> >>> [1] Link: https://lore.kernel.org/all/[email protected]/
> >
> > So it looks like it implicitly tries to pair with tun_ring_consume():
> >
> > 1) spinlock(consumer_lock)
> > 2) store NULL to ptr_ring // STORE
> > 3) spinunlock(consumer_lock) // RELEASE
> > 4) spinlock(consumer_lock) // ACQURE
> > 5) check empty
> > 6) spinunlock(consumer_lock)
> > 7) netif_wakeup_queue() // test_and_set() which is an RMW
> >
> > RELEASE + ACQUIRE implies a full barrier
>
> Thanks.
>
> >
> > I see several problems
> >
> > 1) Due to batch consumption, we may get spurious wakeups under heavy
> > load (we can try disabling batch consuming to see if it helps).
>
> I assume that you mean the waking in the recheck of the producer happens
> too often and then wakes too often. But this would just take slightly
> more producer cpu as the SOFTIRQ runs on the producer cpu and not slow
> down the consumer?
>
> Why would disabling batch consume help here?
We could end up with.
1) consumer wakes up the producer but the slot is not cleaned
2) producer is woken up but see the ring is full, so it need to drop the packet
This probably defeat the goal of zero packet loss.
> Wouldn't it just decrease the consumer speed?
Not sure, we probably need a benchmark.
>
> Apart from that I do not see a different method to do this recheck.
> The ring producer is only safely able to do a !produce_peek (so a check
> for !full).
>
> The normal waking (after consuming half of the ring) should be fine IMO.
>
> > 2) So the barriers don't help but would slow down the consuming
> > 3) Two spinlocks were used instead of one, this is another reason we
> > will see a performance regression
>
> You are right, I can change it to a single spin_lock. Apart from that
> I do not see how the barriers/locking could be reduced further.
>
> > 4) Tricky code that needs to be understood or at least requires a comment
> > tweak.
> >
> > Note that due to ~IFF_TX_SKB_SHARING, pktgen can't clone skbs, so we
> > may not notice the real degradation.
>
> So run pktgen with pg_set SHARED?
Probably (as a workaround).
> I am pretty sure that the vhost
> thread was always at 100% CPU so pktgen was not the bottleneck. And when
> I had perf enabled I always saw that in my patched version not the
> creation of SKB's took most CPU in pktgen but a different unnamed
> function (I assume this is a waiting function).
Let's try and see.
Thanks
>
>
> Thank you!
>
> >
> >>>
> >>>>> + if (!__ptr_ring_produce_peek(&tfile->tx_ring))
> >>>>> + netif_tx_wake_queue(queue);
> >>>>> + }
> >>>>> + spin_unlock(&tfile->tx_ring.producer_lock);
> >>>>> +
> >>>>> + if (ret) {
> >>>>> + /* If a qdisc is attached to our virtual device,
> >>>>> + * returning NETDEV_TX_BUSY is allowed.
> >>>>> + */
> >>>>> + if (qdisc_present) {
> >>>>> + rcu_read_unlock();
> >>>>> + return NETDEV_TX_BUSY;
> >>>>> + }
> >>>>> drop_reason = SKB_DROP_REASON_FULL_RING;
> >>>>> goto drop;
> >>>>> }
> >>>>>
> >>>>> /* dev->lltx requires to do our own update of trans_start */
> >>>>> - queue = netdev_get_tx_queue(dev, txq);
> >>>>> txq_trans_cond_update(queue);
> >>>>>
> >>>>> /* Notify and wake up reader process */
> >>>>> --
> >>>>> 2.43.0
> >>>>>
> >>>>
> >>>> Thanks
> >>>>
> >>
> >
> > Thanks
> >
>