This commit prevents tail-drop when a qdisc is present and the ptr_ring becomes full. Once an entry is successfully produced and the ptr_ring reaches capacity, the netdev queue is stopped instead of dropping subsequent packets.
If producing an entry fails anyways due to a race, tun_net_xmit returns NETDEV_TX_BUSY, again avoiding a drop. Such races are expected because LLTX is enabled and the transmit path operates without the usual locking. The existing __tun_wake_queue() function wakes the netdev queue. Races between this wakeup and the queue-stop logic could leave the queue stopped indefinitely. To prevent this, a memory barrier is enforced (as discussed in a similar implementation in [1]), followed by a recheck that wakes the queue if space is already available. If no qdisc is present, the previous tail-drop behavior is preserved. Benchmarks: The benchmarks show a slight regression in raw transmission performance, though no packets are lost anymore. The previously introduced threshold to only wake after the queue stopped and half of the ring was consumed showed to be a descent choice: Waking the queue whenever a consume made space in the ring strongly degrades performance for tap, while waking only when the ring is empty is too late and also hurts throughput for tap & tap+vhost-net. Other ratios (3/4, 7/8) showed similar results (not shown here), so 1/2 was chosen for the sake of simplicity for both tun/tap and tun/tap+vhost-net. Test setup: AMD Ryzen 5 5600X at 4.3 GHz, 3200 MHz RAM, isolated QEMU threads; Average over 20 runs @ 100,000,000 packets. SRSO and spectre v2 mitigations disabled. Note for tap+vhost-net: XDP drop program active in VM -> ~2.5x faster, slower for tap due to more syscalls (high utilization of entry_SYSRETQ_unsafe_stack in perf) +--------------------------+--------------+----------------+----------+ | 1 thread | Stock | Patched with | diff | | sending | | fq_codel qdisc | | +------------+-------------+--------------+----------------+----------+ | TAP | Transmitted | 1.151 Mpps | 1.139 Mpps | -1.1% | | +-------------+--------------+----------------+----------+ | | Lost/s | 3.606 Mpps | 0 pps | | +------------+-------------+--------------+----------------+----------+ | TAP | Transmitted | 3.948 Mpps | 3.738 Mpps | -5.3% | | +-------------+--------------+----------------+----------+ | +vhost-net | Lost/s | 496.5 Kpps | 0 pps | | +------------+-------------+--------------+----------------+----------+ +--------------------------+--------------+----------------+----------+ | 2 threads | Stock | Patched with | diff | | sending | | fq_codel qdisc | | +------------+-------------+--------------+----------------+----------+ | TAP | Transmitted | 1.133 Mpps | 1.109 Mpps | -2.1% | | +-------------+--------------+----------------+----------+ | | Lost/s | 8.269 Mpps | 0 pps | | +------------+-------------+--------------+----------------+----------+ | TAP | Transmitted | 3.820 Mpps | 3.513 Mpps | -8.0% | | +-------------+--------------+----------------+----------+ | +vhost-net | Lost/s | 4.961 Mpps | 0 pps | | +------------+-------------+--------------+----------------+----------+ [1] Link: https://lore.kernel.org/all/[email protected]/ Co-developed-by: Tim Gebauer <[email protected]> Signed-off-by: Tim Gebauer <[email protected]> Signed-off-by: Simon Schippers <[email protected]> --- drivers/net/tun.c | 30 ++++++++++++++++++++++++++++-- 1 file changed, 28 insertions(+), 2 deletions(-) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index b86582cc6cb6..9b7daec69acd 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -1011,6 +1011,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev) struct netdev_queue *queue; struct tun_file *tfile; int len = skb->len; + bool qdisc_present; + int ret; rcu_read_lock(); tfile = rcu_dereference(tun->tfiles[txq]); @@ -1063,13 +1065,37 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev) nf_reset_ct(skb); - if (ptr_ring_produce(&tfile->tx_ring, skb)) { + queue = netdev_get_tx_queue(dev, txq); + qdisc_present = !qdisc_txq_has_no_queue(queue); + + spin_lock(&tfile->tx_ring.producer_lock); + ret = __ptr_ring_produce(&tfile->tx_ring, skb); + if (__ptr_ring_produce_peek(&tfile->tx_ring) && qdisc_present) { + netif_tx_stop_queue(queue); + /* Avoid races with queue wake-ups in __tun_wake_queue by + * waking if space is available in a re-check. + * The barrier makes sure that the stop is visible before + * we re-check. + */ + smp_mb__after_atomic(); + if (!__ptr_ring_produce_peek(&tfile->tx_ring)) + netif_tx_wake_queue(queue); + } + spin_unlock(&tfile->tx_ring.producer_lock); + + if (ret) { + /* If a qdisc is attached to our virtual device, + * returning NETDEV_TX_BUSY is allowed. + */ + if (qdisc_present) { + rcu_read_unlock(); + return NETDEV_TX_BUSY; + } drop_reason = SKB_DROP_REASON_FULL_RING; goto drop; } /* dev->lltx requires to do our own update of trans_start */ - queue = netdev_get_tx_queue(dev, txq); txq_trans_cond_update(queue); /* Notify and wake up reader process */ -- 2.43.0

