On Wed, 2016-11-30 at 15:50 +0200, Saeed Mahameed wrote: > On Tue, Nov 29, 2016 at 8:58 AM, Eric Dumazet <eric.duma...@gmail.com> wrote: > > On Mon, 2016-11-21 at 10:10 -0800, Eric Dumazet wrote: > > > > > >> Not sure it this has been tried before, but the doorbell avoidance could > >> be done by the driver itself, because it knows a TX completion will come > >> shortly (well... if softirqs are not delayed too much !) > >> > >> Doorbell would be forced only if : > >> > >> ( "skb->xmit_more is not set" AND "TX engine is not 'started yet'" ) > >> OR > >> ( too many [1] packets were put in TX ring buffer, no point deferring > >> more) > >> > >> Start the pump, but once it is started, let the doorbells being done by > >> TX completion. > >> > >> ndo_start_xmit and TX completion handler would have to maintain a shared > >> state describing if packets were ready but doorbell deferred. > >> > >> > >> Note that TX completion means "if at least one packet was drained", > >> otherwise busy polling, constantly calling napi->poll() would force a > >> doorbell too soon for devices sharing a NAPI for both RX and TX. > >> > >> But then, maybe busy poll would like to force a doorbell... > >> > >> I could try these ideas on mlx4 shortly. > >> > >> > >> [1] limit could be derived from active "ethtool -c" params, eg tx-frames > > > > I have a WIP, that increases pktgen rate by 75 % on mlx4 when bulking is > > not used. > > Hi Eric, Nice Idea indeed and we need something like this, > today we almost don't exploit the TX bulking at all. > > But please see below, i am not sure different contexts should share > the doorbell ringing, it is really risky. > > > drivers/net/ethernet/mellanox/mlx4/en_rx.c | 2 > > drivers/net/ethernet/mellanox/mlx4/en_tx.c | 90 +++++++++++------ > > drivers/net/ethernet/mellanox/mlx4/mlx4_en.h | 4 > > include/linux/netdevice.h | 1 > > net/core/net-sysfs.c | 18 +++ > > 5 files changed, 83 insertions(+), 32 deletions(-) > > > > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c > > b/drivers/net/ethernet/mellanox/mlx4/en_rx.c > > index 6562f78b07f4..fbea83218fc0 100644 > > --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c > > +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c > > @@ -1089,7 +1089,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, > > struct mlx4_en_cq *cq, int bud > > > > if (polled) { > > if (doorbell_pending) > > - > > mlx4_en_xmit_doorbell(priv->tx_ring[TX_XDP][cq->ring]); > > + mlx4_en_xmit_doorbell(dev, > > priv->tx_ring[TX_XDP][cq->ring]); > > > > mlx4_cq_set_ci(&cq->mcq); > > wmb(); /* ensure HW sees CQ consumer before we post new > > buffers */ > > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c > > b/drivers/net/ethernet/mellanox/mlx4/en_tx.c > > index 4b597dca5c52..affebb435679 100644 > > --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c > > +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c > > @@ -67,7 +67,7 @@ int mlx4_en_create_tx_ring(struct mlx4_en_priv *priv, > > ring->size = size; > > ring->size_mask = size - 1; > > ring->sp_stride = stride; > > - ring->full_size = ring->size - HEADROOM - MAX_DESC_TXBBS; > > + ring->full_size = ring->size - HEADROOM - 2*MAX_DESC_TXBBS; > > > > tmp = size * sizeof(struct mlx4_en_tx_info); > > ring->tx_info = kmalloc_node(tmp, GFP_KERNEL | __GFP_NOWARN, node); > > @@ -193,6 +193,7 @@ int mlx4_en_activate_tx_ring(struct mlx4_en_priv *priv, > > ring->sp_cqn = cq; > > ring->prod = 0; > > ring->cons = 0xffffffff; > > + ring->ncons = 0; > > ring->last_nr_txbb = 1; > > memset(ring->tx_info, 0, ring->size * sizeof(struct > > mlx4_en_tx_info)); > > memset(ring->buf, 0, ring->buf_size); > > @@ -227,9 +228,9 @@ void mlx4_en_deactivate_tx_ring(struct mlx4_en_priv > > *priv, > > MLX4_QP_STATE_RST, NULL, 0, 0, &ring->sp_qp); > > } > > > > -static inline bool mlx4_en_is_tx_ring_full(struct mlx4_en_tx_ring *ring) > > +static inline bool mlx4_en_is_tx_ring_full(const struct mlx4_en_tx_ring > > *ring) > > { > > - return ring->prod - ring->cons > ring->full_size; > > + return READ_ONCE(ring->prod) - READ_ONCE(ring->cons) > > > ring->full_size; > > } > > > > static void mlx4_en_stamp_wqe(struct mlx4_en_priv *priv, > > @@ -374,6 +375,7 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct > > mlx4_en_tx_ring *ring) > > > > /* Skip last polled descriptor */ > > ring->cons += ring->last_nr_txbb; > > + ring->ncons += ring->last_nr_txbb; > > en_dbg(DRV, priv, "Freeing Tx buf - cons:0x%x prod:0x%x\n", > > ring->cons, ring->prod); > > > > @@ -389,6 +391,7 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct > > mlx4_en_tx_ring *ring) > > !!(ring->cons & > > ring->size), 0, > > 0 /* Non-NAPI caller */); > > ring->cons += ring->last_nr_txbb; > > + ring->ncons += ring->last_nr_txbb; > > cnt++; > > } > > > > @@ -401,6 +404,38 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct > > mlx4_en_tx_ring *ring) > > return cnt; > > } > > > > +void mlx4_en_xmit_doorbell(const struct net_device *dev, > > + struct mlx4_en_tx_ring *ring) > > +{ > > + > > + if (dev->doorbell_opt & 1) { > > + u32 oval = READ_ONCE(ring->prod_bell); > > + u32 nval = READ_ONCE(ring->prod); > > + > > + if (oval == nval) > > + return; > > + > > + /* I can not tell yet if a cmpxchg() is needed or not */ > > + if (dev->doorbell_opt & 2) > > + WRITE_ONCE(ring->prod_bell, nval); > > + else > > + if (cmpxchg(&ring->prod_bell, oval, nval) != oval) > > + return; > > + } > > + /* Since there is no iowrite*_native() that writes the > > + * value as is, without byteswapping - using the one > > + * the doesn't do byteswapping in the relevant arch > > + * endianness. > > + */ > > +#if defined(__LITTLE_ENDIAN) > > + iowrite32( > > +#else > > + iowrite32be( > > +#endif > > + ring->doorbell_qpn, > > + ring->bf.uar->map + MLX4_SEND_DOORBELL); > > +} > > + > > static bool mlx4_en_process_tx_cq(struct net_device *dev, > > struct mlx4_en_cq *cq, int napi_budget) > > { > > @@ -496,8 +531,13 @@ static bool mlx4_en_process_tx_cq(struct net_device > > *dev, > > wmb(); > > > > /* we want to dirty this cache line once */ > > - ACCESS_ONCE(ring->last_nr_txbb) = last_nr_txbb; > > - ACCESS_ONCE(ring->cons) = ring_cons + txbbs_skipped; > > + WRITE_ONCE(ring->last_nr_txbb, last_nr_txbb); > > + ring_cons += txbbs_skipped; > > + WRITE_ONCE(ring->cons, ring_cons); > > + WRITE_ONCE(ring->ncons, ring_cons + last_nr_txbb); > > + > > + if (dev->doorbell_opt) > > + mlx4_en_xmit_doorbell(dev, ring); > > > > if (ring->free_tx_desc == mlx4_en_recycle_tx_desc) > > return done < budget; > > @@ -725,29 +765,14 @@ static void mlx4_bf_copy(void __iomem *dst, const > > void *src, > > __iowrite64_copy(dst, src, bytecnt / 8); > > } > > > > -void mlx4_en_xmit_doorbell(struct mlx4_en_tx_ring *ring) > > -{ > > - wmb(); > > you missed/removed this "wmb()" in the new function, why ? where did > you compensate for it ?
I removed it because I had a cmpxchg() there if the barrier was needed. My patch is a WIP, where you can set the bit 2 to ask to replace the cmpxchg() by a simple write, only for performance testing/comparisons. > > > - /* Since there is no iowrite*_native() that writes the > > - * value as is, without byteswapping - using the one > > - * the doesn't do byteswapping in the relevant arch > > - * endianness. > > - */ > > -#if defined(__LITTLE_ENDIAN) > > - iowrite32( > > -#else > > - iowrite32be( > > -#endif > > - ring->doorbell_qpn, > > - ring->bf.uar->map + MLX4_SEND_DOORBELL); > > -} > > > > static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring, > > struct mlx4_en_tx_desc *tx_desc, > > union mlx4_wqe_qpn_vlan qpn_vlan, > > int desc_size, int bf_index, > > __be32 op_own, bool bf_ok, > > - bool send_doorbell) > > + bool send_doorbell, > > + const struct net_device *dev, int nr_txbb) > > { > > tx_desc->ctrl.qpn_vlan = qpn_vlan; > > > > @@ -761,6 +786,7 @@ static void mlx4_en_tx_write_desc(struct > > mlx4_en_tx_ring *ring, > > > > wmb(); > > > > + ring->prod += nr_txbb; > > mlx4_bf_copy(ring->bf.reg + ring->bf.offset, &tx_desc->ctrl, > > desc_size); > > > > @@ -773,8 +799,9 @@ static void mlx4_en_tx_write_desc(struct > > mlx4_en_tx_ring *ring, > > */ > > dma_wmb(); > > tx_desc->ctrl.owner_opcode = op_own; > > + ring->prod += nr_txbb; > > if (send_doorbell) > > - mlx4_en_xmit_doorbell(ring); > > + mlx4_en_xmit_doorbell(dev, ring); > > else > > ring->xmit_more++; > > } > > @@ -1017,8 +1044,6 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct > > net_device *dev) > > op_own |= cpu_to_be32(MLX4_WQE_CTRL_IIP); > > } > > > > - ring->prod += nr_txbb; > > - > > /* If we used a bounce buffer then copy descriptor back into place > > */ > > if (unlikely(bounce)) > > tx_desc = mlx4_en_bounce_to_desc(priv, ring, index, > > desc_size); > > @@ -1033,6 +1058,14 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct > > net_device *dev) > > } > > send_doorbell = !skb->xmit_more || > > netif_xmit_stopped(ring->tx_queue); > > > > + /* Doorbell avoidance : We can omit doorbell if we know a TX > > completion > > + * will happen shortly. > > + */ > > + if (send_doorbell && > > + dev->doorbell_opt && > > + (s32)(READ_ONCE(ring->prod_bell) - READ_ONCE(ring->ncons)) > 0) > > Aelexi already expressed his worries about synchronization, and i > think here (in this exact line) sits the problem, > what about if exactly at this point the TX completion handler just > finished and rang the last doorbell, > you didn't write the new TX descriptor yet (mlx4_en_tx_write_desc), so > the last doorbell from the CQ handler missed it. > even if you wrote the TX descriptor before the doorbell decision here, > you will need a memory barrier to make sure the HW sees > the new packet, which was typically done before ringing the doorbell. My patch is a WIP, meaning it is not completed ;) Surely we can find a non racy way to handle this. > All in all, this is risky business :), the right way to go is to > force the upper layer to use xmit-more and delay doorbells/use bulking > but from the same context > (xmit routine). For example see Achiad's suggestion (attached in > Jesper's response), he used stop queue to force the stack to queue up > packets (TX bulking) > which would set xmit-more and will use the next completion to release > the "stopped" ring TXQ rather than hit the doorbell on behalf of it. Well, you depend on having a higher level queue like a qdisc. Some users do not use a qdisc. If you stop the queue, they no longer can send anything -> drops. T