On Wed, 2016-11-30 at 12:38 +0100, Jesper Dangaard Brouer wrote:
> I've played with a somewhat similar patch (from Achiad Shochat) for
> mlx5 (attached).  While it gives huge improvements, the problem I ran
> into was that; TX performance became a function of the TX completion
> time/interrupt and could easily be throttled if configured too
> high/slow.
> 
> Can your patch be affected by this too?

Like all TX business, you should know this Jesper.

No need to constantly remind us something very well known.

> 
> Adjustable via:
>  ethtool -C mlx5p2 tx-usecs 16 tx-frames 32
>  

Generally speaking these settings impact TX, throughput/latencies.

Since NIC IRQ then trigger softirqs, we already know that IRQ handling
is critical, and some tuning can help, depending on particular load.

Now the trick is when you want a generic kernel, being deployed on hosts
shared by millions of jobs.


> 
> On Mon, 28 Nov 2016 22:58:36 -0800 Eric Dumazet <eric.duma...@gmail.com> 
> wrote:
> 
> > I have a WIP, that increases pktgen rate by 75 % on mlx4 when bulking is
> > not used.
> > 
> > lpaa23:~# echo 0 >/sys/class/net/eth0/doorbell_opt 
> > lpaa23:~# sar -n DEV 1 10|grep eth0
> [...]
> > Average:         eth0      9.50 5707925.60      0.99 585285.69      0.00    
> >   0.00      0.50
> > lpaa23:~# echo 1 >/sys/class/net/eth0/doorbell_opt 
> > lpaa23:~# sar -n DEV 1 10|grep eth0
> [...]
> > Average:         eth0      2.40 9985214.90      0.31 1023874.60      0.00   
> >    0.00      0.50
> 
> These +75% number is pktgen without "burst", and definitely show that
> your patch activate xmit_more.
> What is the pps performance number when using pktgen "burst" option?

About the same really. About all packets now get the xmit_more effect.

> 
> 
> > And about 11 % improvement on an mono-flow UDP_STREAM test.
> > 
> > skb_set_owner_w() is now the most consuming function.
> > 
> > 
> > lpaa23:~# ./udpsnd -4 -H 10.246.7.152 -d 2 &
> > [1] 13696
> > lpaa23:~# echo 0 >/sys/class/net/eth0/doorbell_opt
> > lpaa23:~# sar -n DEV 1 10|grep eth0
> [...]
> > Average:         eth0      9.00 1307380.50      1.00 308356.18      0.00    
> >   0.00      0.50
> > lpaa23:~# echo 3 >/sys/class/net/eth0/doorbell_opt
> > lpaa23:~# sar -n DEV 1 10|grep eth0
> [...]
> > Average:         eth0      3.10 1459558.20      0.44 344267.57      0.00    
> >   0.00      0.50
> 
> The +11% number seems consistent with my perf observations that approx
> 12% was "fakely" spend on the xmit spin_lock.
> 
> 
> [...]
> > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c 
> > b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> > index 4b597dca5c52..affebb435679 100644
> > --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> > +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> [...]
> > -static inline bool mlx4_en_is_tx_ring_full(struct mlx4_en_tx_ring *ring)
> > +static inline bool mlx4_en_is_tx_ring_full(const struct mlx4_en_tx_ring 
> > *ring)
> >  {
> > -   return ring->prod - ring->cons > ring->full_size;
> > +   return READ_ONCE(ring->prod) - READ_ONCE(ring->cons) > ring->full_size;
> >  }
> [...]
> 
> > @@ -1033,6 +1058,14 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct 
> > net_device *dev)
> >     }
> >     send_doorbell = !skb->xmit_more || netif_xmit_stopped(ring->tx_queue);
> >  
> > +   /* Doorbell avoidance : We can omit doorbell if we know a TX completion
> > +    * will happen shortly.
> > +    */
> > +   if (send_doorbell &&
> > +       dev->doorbell_opt &&
> > +       (s32)(READ_ONCE(ring->prod_bell) - READ_ONCE(ring->ncons)) > 0)
> 
> It would be nice with a function call with an appropriate name, instead
> of an open-coded queue size check.  I'm also confused by the "ncons" name.
> 
> > +           send_doorbell = false;
> > +
> [...]
> 
> > diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h 
> > b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> > index 574bcbb1b38f..c3fd0deda198 100644
> > --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> > +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> > @@ -280,6 +280,7 @@ struct mlx4_en_tx_ring {
> >      */
> >     u32                     last_nr_txbb;
> >     u32                     cons;
> > +   u32                     ncons;
> 
> Maybe we can find a better name than "ncons" ?

Thats because 'cons' in this driver really means 'old cons' 

and new cons = old cons + last_nr_txbb;


> 
> >     unsigned long           wake_queue;
> >     struct netdev_queue     *tx_queue;
> >     u32                     (*free_tx_desc)(struct mlx4_en_priv *priv,
> > @@ -290,6 +291,7 @@ struct mlx4_en_tx_ring {
> >  
> >     /* cache line used and dirtied in mlx4_en_xmit() */
> >     u32                     prod ____cacheline_aligned_in_smp;
> > +   u32                     prod_bell;
> 
> Good descriptive variable name.
> 
> >     unsigned int            tx_dropped;
> >     unsigned long           bytes;
> >     unsigned long           packets;
> 
> 


Reply via email to