On Fri, Mar 27, 2026 at 09:13:44AM +0800, Jason Wang wrote:
> On Thu, Mar 26, 2026 at 11:31 PM Simon Schippers
> <[email protected]> wrote:
> >
> > On 3/26/26 03:41, Jason Wang wrote:
> > > On Wed, Mar 25, 2026 at 10:48 PM Simon Schippers
> > > <[email protected]> wrote:
> > >>
> > >> On 3/24/26 11:14, Simon Schippers wrote:
> > >>> On 3/24/26 02:47, Jason Wang wrote:
> > >>>> On Thu, Mar 12, 2026 at 9:07 PM Simon Schippers
> > >>>> <[email protected]> wrote:
> > >>>>>
> > >>>>> This commit prevents tail-drop when a qdisc is present and the
> > >>>>> ptr_ring
> > >>>>> becomes full. Once an entry is successfully produced and the ptr_ring
> > >>>>> reaches capacity, the netdev queue is stopped instead of dropping
> > >>>>> subsequent packets.
> > >>>>>
> > >>>>> If producing an entry fails anyways due to a race, tun_net_xmit
> > >>>>> returns
> > >>>>> NETDEV_TX_BUSY, again avoiding a drop. Such races are expected because
> > >>>>> LLTX is enabled and the transmit path operates without the usual
> > >>>>> locking.
> > >>>>>
> > >>>>> The existing __tun_wake_queue() function wakes the netdev queue. Races
> > >>>>> between this wakeup and the queue-stop logic could leave the queue
> > >>>>> stopped indefinitely. To prevent this, a memory barrier is enforced
> > >>>>> (as discussed in a similar implementation in [1]), followed by a
> > >>>>> recheck
> > >>>>> that wakes the queue if space is already available.
> > >>>>>
> > >>>>> If no qdisc is present, the previous tail-drop behavior is preserved.
> > >>>>
> > >>>> I wonder if we need a dedicated TUN flag to enable this. With this new
> > >>>> flag, we can even prevent TUN from using noqueue (not sure if it's
> > >>>> possible or not).
> > >>>>
> > >>>
> > >>> Except of the slight regressions because of this patchset I do not see
> > >>> a reason for such a flag.
> > >>>
> > >>> I have never seen that the driver prevents noqueue. For example you can
> > >>> set noqueue to your ethernet interface and under load you soon get
> > >>>
> > >>> net_crit_ratelimited("Virtual device %s asks to queue packet!\n",
> > >>> dev->name);
> > >>>
> > >>> followed by a -ENETDOWN. And this is not prevented even though it is
> > >>> clearly not something a user wants.
> > >>>
> > >>>>>
> > >>>>> Benchmarks:
> > >>>>> The benchmarks show a slight regression in raw transmission
> > >>>>> performance,
> > >>>>> though no packets are lost anymore.
> > >>>>>
> > >>>>> The previously introduced threshold to only wake after the queue
> > >>>>> stopped
> > >>>>> and half of the ring was consumed showed to be a descent choice:
> > >>>>> Waking the queue whenever a consume made space in the ring strongly
> > >>>>> degrades performance for tap, while waking only when the ring is empty
> > >>>>> is too late and also hurts throughput for tap & tap+vhost-net.
> > >>>>> Other ratios (3/4, 7/8) showed similar results (not shown here), so
> > >>>>> 1/2 was chosen for the sake of simplicity for both tun/tap and
> > >>>>> tun/tap+vhost-net.
> > >>>>>
> > >>>>> Test setup:
> > >>>>> AMD Ryzen 5 5600X at 4.3 GHz, 3200 MHz RAM, isolated QEMU threads;
> > >>>>> Average over 20 runs @ 100,000,000 packets. SRSO and spectre v2
> > >>>>> mitigations disabled.
> > >>>>>
> > >>>>> Note for tap+vhost-net:
> > >>>>> XDP drop program active in VM -> ~2.5x faster, slower for tap due to
> > >>>>> more syscalls (high utilization of entry_SYSRETQ_unsafe_stack in perf)
> > >>>>>
> > >>>>> +--------------------------+--------------+----------------+----------+
> > >>>>> | 1 thread | Stock | Patched with | diff
> > >>>>> |
> > >>>>> | sending | | fq_codel qdisc |
> > >>>>> |
> > >>>>> +------------+-------------+--------------+----------------+----------+
> > >>>>> | TAP | Transmitted | 1.151 Mpps | 1.139 Mpps | -1.1%
> > >>>>> |
> > >>>>> |
> > >>>>> +-------------+--------------+----------------+----------+
> > >>>>> | | Lost/s | 3.606 Mpps | 0 pps |
> > >>>>> |
> > >>>>> +------------+-------------+--------------+----------------+----------+
> > >>>>> | TAP | Transmitted | 3.948 Mpps | 3.738 Mpps | -5.3%
> > >>>>> |
> > >>>>> |
> > >>>>> +-------------+--------------+----------------+----------+
> > >>>>> | +vhost-net | Lost/s | 496.5 Kpps | 0 pps |
> > >>>>> |
> > >>>>> +------------+-------------+--------------+----------------+----------+
> > >>>>>
> > >>>>> +--------------------------+--------------+----------------+----------+
> > >>>>> | 2 threads | Stock | Patched with | diff
> > >>>>> |
> > >>>>> | sending | | fq_codel qdisc |
> > >>>>> |
> > >>>>> +------------+-------------+--------------+----------------+----------+
> > >>>>> | TAP | Transmitted | 1.133 Mpps | 1.109 Mpps | -2.1%
> > >>>>> |
> > >>>>> |
> > >>>>> +-------------+--------------+----------------+----------+
> > >>>>> | | Lost/s | 8.269 Mpps | 0 pps |
> > >>>>> |
> > >>>>> +------------+-------------+--------------+----------------+----------+
> > >>>>> | TAP | Transmitted | 3.820 Mpps | 3.513 Mpps | -8.0%
> > >>>>> |
> > >>>>> |
> > >>>>> +-------------+--------------+----------------+----------+
> > >>>>> | +vhost-net | Lost/s | 4.961 Mpps | 0 pps |
> > >>>>> |
> > >>>>> +------------+-------------+--------------+----------------+----------+
> > >>>>>
> > >>>>> [1] Link:
> > >>>>> https://lore.kernel.org/all/[email protected]/
> > >>>>>
> > >>>>> Co-developed-by: Tim Gebauer <[email protected]>
> > >>>>> Signed-off-by: Tim Gebauer <[email protected]>
> > >>>>> Signed-off-by: Simon Schippers <[email protected]>
> > >>>>> ---
> > >>>>> drivers/net/tun.c | 30 ++++++++++++++++++++++++++++--
> > >>>>> 1 file changed, 28 insertions(+), 2 deletions(-)
> > >>>>>
> > >>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> > >>>>> index b86582cc6cb6..9b7daec69acd 100644
> > >>>>> --- a/drivers/net/tun.c
> > >>>>> +++ b/drivers/net/tun.c
> > >>>>> @@ -1011,6 +1011,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff
> > >>>>> *skb, struct net_device *dev)
> > >>>>> struct netdev_queue *queue;
> > >>>>> struct tun_file *tfile;
> > >>>>> int len = skb->len;
> > >>>>> + bool qdisc_present;
> > >>>>> + int ret;
> > >>>>>
> > >>>>> rcu_read_lock();
> > >>>>> tfile = rcu_dereference(tun->tfiles[txq]);
> > >>>>> @@ -1063,13 +1065,37 @@ static netdev_tx_t tun_net_xmit(struct
> > >>>>> sk_buff *skb, struct net_device *dev)
> > >>>>>
> > >>>>> nf_reset_ct(skb);
> > >>>>>
> > >>>>> - if (ptr_ring_produce(&tfile->tx_ring, skb)) {
> > >>>>> + queue = netdev_get_tx_queue(dev, txq);
> > >>>>> + qdisc_present = !qdisc_txq_has_no_queue(queue);
> > >>>>> +
> > >>>>> + spin_lock(&tfile->tx_ring.producer_lock);
> > >>>>> + ret = __ptr_ring_produce(&tfile->tx_ring, skb);
> > >>>>> + if (__ptr_ring_produce_peek(&tfile->tx_ring) &&
> > >>>>> qdisc_present) {
> > >>>>
> > >>>> So, it's possible that the administrator is switching between noqueue
> > >>>> and another qdisc. So ptr_ring_produce() can fail here, do we need to
> > >>>> check that or not?
> > >>>>
> > >>>
> > >>> Do you mean that? My thoughts:
> > >>>
> > >>> Switching from noqueue to some qdisc can cause a
> > >>>
> > >>> net_crit_ratelimited("Virtual device %s asks to queue packet!\n",
> > >>> dev->name);
> > >>>
> > >>> followed by a return of -ENETDOWN in __dev_queue_xmit().
> > >>> This is because tun_net_xmit detects some qdisc with
> > >>>
> > >>> qdisc_present = !qdisc_txq_has_no_queue(queue);
> > >>>
> > >>> and returns NETDEV_TX_BUSY even though __dev_queue_xmit() did still
> > >>> detect noqueue.
> > >>>
> > >>> I am not sure how to solve this/if this has to be solved.
> > >>> I do not see a proper way to avoid parallel execution of ndo_start_xmit
> > >>> and a qdisc change (dev_graft_qdisc only takes qdisc_skb_head lock).
> > >>>
> > >>> And from my understanding the veth implementation faces the same issue.
> > >>
> > >> How about rechecking if a qdisc is connected?
> > >> This would avoid -ENETDOWN.
> > >>
> > >> diff --git a/net/core/dev.c b/net/core/dev.c
> > >> index f48dc299e4b2..2731a1a70732 100644
> > >> --- a/net/core/dev.c
> > >> +++ b/net/core/dev.c
> > >> @@ -4845,10 +4845,17 @@ int __dev_queue_xmit(struct sk_buff *skb, struct
> > >> net_device *sb_dev)
> > >> if (is_list)
> > >> rc = NETDEV_TX_OK;
> > >> }
> > >> + bool qdisc_present =
> > >> !qdisc_txq_has_no_queue(txq);
> > >> HARD_TX_UNLOCK(dev, txq);
> > >> if (!skb) /* xmit completed */
> > >> goto out;
> > >>
> > >> + /* Maybe a qdisc was connected in the meantime */
> > >> + if (qdisc_present) {
> > >> + kfree_skb(skb);
> > >> + goto out;
> > >> + }
> > >> +
> > >> net_crit_ratelimited("Virtual device %s asks to
> > >> queue packet!\n",
> > >> dev->name);
> > >> /* NETDEV_TX_BUSY or queue was stopped */
> > >>
> > >
> > > Probably not, and we likely won't hit this warning because qdisc could
> > > not be changed during ndo_start_xmit().
> >
> > Okay.
> >
> > >
> > > I meant something like this:
> > >
> > > 1) set noqueue to tuntap
> > > 2) produce packets so tuntap is full
> > > 3) set e.g fq_codel to tuntap
> > > 4) then we can hit the failure of __ptr_ring_produce()
> > >
> > > Rethink of the code, it looks just fine.
> >
> > Yes, in this case it just returns NETDEV_TX_BUSY which is fine with a
> > qdisc attached.
> >
> > >
> > >>
> > >>>
> > >>>
> > >>> Switching from some qdisc to noqueue is no problem I think.
> > >>>
> > >>>>> + netif_tx_stop_queue(queue);
> > >>>>> + /* Avoid races with queue wake-ups in
> > >>>>> __tun_wake_queue by
> > >>>>> + * waking if space is available in a re-check.
> > >>>>> + * The barrier makes sure that the stop is visible
> > >>>>> before
> > >>>>> + * we re-check.
> > >>>>> + */
> > >>>>> + smp_mb__after_atomic();
> > >>>>
> > >>>> Let's document which barrier is paired with this.
> > >>>>
> > >>>
> > >>> I am basically copying the (old) logic of veth [1] proposed by
> > >>> Jakub Kicinski. I must admit I am not 100% sure what it pairs with.
> > >>>
> > >>> [1] Link:
> > >>> https://lore.kernel.org/all/[email protected]/
> > >
> > > So it looks like it implicitly tries to pair with tun_ring_consume():
> > >
> > > 1) spinlock(consumer_lock)
> > > 2) store NULL to ptr_ring // STORE
> > > 3) spinunlock(consumer_lock) // RELEASE
> > > 4) spinlock(consumer_lock) // ACQURE
> > > 5) check empty
> > > 6) spinunlock(consumer_lock)
> > > 7) netif_wakeup_queue() // test_and_set() which is an RMW
> > >
> > > RELEASE + ACQUIRE implies a full barrier
> >
> > Thanks.
> >
> > >
> > > I see several problems
> > >
> > > 1) Due to batch consumption, we may get spurious wakeups under heavy
> > > load (we can try disabling batch consuming to see if it helps).
> >
> > I assume that you mean the waking in the recheck of the producer happens
> > too often and then wakes too often. But this would just take slightly
> > more producer cpu as the SOFTIRQ runs on the producer cpu and not slow
> > down the consumer?
> >
> > Why would disabling batch consume help here?
>
> We could end up with.
>
> 1) consumer wakes up the producer but the slot is not cleaned
> 2) producer is woken up but see the ring is full, so it need to drop the
> packet
>
> This probably defeat the goal of zero packet loss.
if this is rare enough, it might not matter.
> > Wouldn't it just decrease the consumer speed?
>
> Not sure, we probably need a benchmark.
>
> >
> > Apart from that I do not see a different method to do this recheck.
> > The ring producer is only safely able to do a !produce_peek (so a check
> > for !full).
> >
> > The normal waking (after consuming half of the ring) should be fine IMO.
> >
> > > 2) So the barriers don't help but would slow down the consuming
> > > 3) Two spinlocks were used instead of one, this is another reason we
> > > will see a performance regression
> >
> > You are right, I can change it to a single spin_lock. Apart from that
> > I do not see how the barriers/locking could be reduced further.
> >
> > > 4) Tricky code that needs to be understood or at least requires a comment
> > > tweak.
> > >
> > > Note that due to ~IFF_TX_SKB_SHARING, pktgen can't clone skbs, so we
> > > may not notice the real degradation.
> >
> > So run pktgen with pg_set SHARED?
>
> Probably (as a workaround).
>
> > I am pretty sure that the vhost
> > thread was always at 100% CPU so pktgen was not the bottleneck. And when
> > I had perf enabled I always saw that in my patched version not the
> > creation of SKB's took most CPU in pktgen but a different unnamed
> > function (I assume this is a waiting function).
>
> Let's try and see.
>
> Thanks
>
> >
> >
> > Thank you!
> >
> > >
> > >>>
> > >>>>> + if (!__ptr_ring_produce_peek(&tfile->tx_ring))
> > >>>>> + netif_tx_wake_queue(queue);
> > >>>>> + }
> > >>>>> + spin_unlock(&tfile->tx_ring.producer_lock);
> > >>>>> +
> > >>>>> + if (ret) {
> > >>>>> + /* If a qdisc is attached to our virtual device,
> > >>>>> + * returning NETDEV_TX_BUSY is allowed.
> > >>>>> + */
> > >>>>> + if (qdisc_present) {
> > >>>>> + rcu_read_unlock();
> > >>>>> + return NETDEV_TX_BUSY;
> > >>>>> + }
> > >>>>> drop_reason = SKB_DROP_REASON_FULL_RING;
> > >>>>> goto drop;
> > >>>>> }
> > >>>>>
> > >>>>> /* dev->lltx requires to do our own update of trans_start */
> > >>>>> - queue = netdev_get_tx_queue(dev, txq);
> > >>>>> txq_trans_cond_update(queue);
> > >>>>>
> > >>>>> /* Notify and wake up reader process */
> > >>>>> --
> > >>>>> 2.43.0
> > >>>>>
> > >>>>
> > >>>> Thanks
> > >>>>
> > >>
> > >
> > > Thanks
> > >
> >