On Wed, 2016-11-30 at 12:38 +0100, Jesper Dangaard Brouer wrote: > I've played with a somewhat similar patch (from Achiad Shochat) for > mlx5 (attached). While it gives huge improvements, the problem I ran > into was that; TX performance became a function of the TX completion > time/interrupt and could easily be throttled if configured too > high/slow. > > Can your patch be affected by this too?
Like all TX business, you should know this Jesper. No need to constantly remind us something very well known. > > Adjustable via: > ethtool -C mlx5p2 tx-usecs 16 tx-frames 32 > Generally speaking these settings impact TX, throughput/latencies. Since NIC IRQ then trigger softirqs, we already know that IRQ handling is critical, and some tuning can help, depending on particular load. Now the trick is when you want a generic kernel, being deployed on hosts shared by millions of jobs. > > On Mon, 28 Nov 2016 22:58:36 -0800 Eric Dumazet <eric.duma...@gmail.com> > wrote: > > > I have a WIP, that increases pktgen rate by 75 % on mlx4 when bulking is > > not used. > > > > lpaa23:~# echo 0 >/sys/class/net/eth0/doorbell_opt > > lpaa23:~# sar -n DEV 1 10|grep eth0 > [...] > > Average: eth0 9.50 5707925.60 0.99 585285.69 0.00 > > 0.00 0.50 > > lpaa23:~# echo 1 >/sys/class/net/eth0/doorbell_opt > > lpaa23:~# sar -n DEV 1 10|grep eth0 > [...] > > Average: eth0 2.40 9985214.90 0.31 1023874.60 0.00 > > 0.00 0.50 > > These +75% number is pktgen without "burst", and definitely show that > your patch activate xmit_more. > What is the pps performance number when using pktgen "burst" option? About the same really. About all packets now get the xmit_more effect. > > > > And about 11 % improvement on an mono-flow UDP_STREAM test. > > > > skb_set_owner_w() is now the most consuming function. > > > > > > lpaa23:~# ./udpsnd -4 -H 10.246.7.152 -d 2 & > > [1] 13696 > > lpaa23:~# echo 0 >/sys/class/net/eth0/doorbell_opt > > lpaa23:~# sar -n DEV 1 10|grep eth0 > [...] > > Average: eth0 9.00 1307380.50 1.00 308356.18 0.00 > > 0.00 0.50 > > lpaa23:~# echo 3 >/sys/class/net/eth0/doorbell_opt > > lpaa23:~# sar -n DEV 1 10|grep eth0 > [...] > > Average: eth0 3.10 1459558.20 0.44 344267.57 0.00 > > 0.00 0.50 > > The +11% number seems consistent with my perf observations that approx > 12% was "fakely" spend on the xmit spin_lock. > > > [...] > > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c > > b/drivers/net/ethernet/mellanox/mlx4/en_tx.c > > index 4b597dca5c52..affebb435679 100644 > > --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c > > +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c > [...] > > -static inline bool mlx4_en_is_tx_ring_full(struct mlx4_en_tx_ring *ring) > > +static inline bool mlx4_en_is_tx_ring_full(const struct mlx4_en_tx_ring > > *ring) > > { > > - return ring->prod - ring->cons > ring->full_size; > > + return READ_ONCE(ring->prod) - READ_ONCE(ring->cons) > ring->full_size; > > } > [...] > > > @@ -1033,6 +1058,14 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct > > net_device *dev) > > } > > send_doorbell = !skb->xmit_more || netif_xmit_stopped(ring->tx_queue); > > > > + /* Doorbell avoidance : We can omit doorbell if we know a TX completion > > + * will happen shortly. > > + */ > > + if (send_doorbell && > > + dev->doorbell_opt && > > + (s32)(READ_ONCE(ring->prod_bell) - READ_ONCE(ring->ncons)) > 0) > > It would be nice with a function call with an appropriate name, instead > of an open-coded queue size check. I'm also confused by the "ncons" name. > > > + send_doorbell = false; > > + > [...] > > > diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h > > b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h > > index 574bcbb1b38f..c3fd0deda198 100644 > > --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h > > +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h > > @@ -280,6 +280,7 @@ struct mlx4_en_tx_ring { > > */ > > u32 last_nr_txbb; > > u32 cons; > > + u32 ncons; > > Maybe we can find a better name than "ncons" ? Thats because 'cons' in this driver really means 'old cons' and new cons = old cons + last_nr_txbb; > > > unsigned long wake_queue; > > struct netdev_queue *tx_queue; > > u32 (*free_tx_desc)(struct mlx4_en_priv *priv, > > @@ -290,6 +291,7 @@ struct mlx4_en_tx_ring { > > > > /* cache line used and dirtied in mlx4_en_xmit() */ > > u32 prod ____cacheline_aligned_in_smp; > > + u32 prod_bell; > > Good descriptive variable name. > > > unsigned int tx_dropped; > > unsigned long bytes; > > unsigned long packets; > >