mlx4: mitigate Tx path memory barriers

Adrien Mazarguil Tue, 31 Oct 2017 03:18:01 -0700

Hi Matan,

On Mon, Oct 30, 2017 at 07:47:20PM +0000, Matan Azrad wrote:
> Hi Adrien
> 
> > -----Original Message-----
> > From: Adrien Mazarguil [mailto:adrien.mazarg...@6wind.com]
> > Sent: Monday, October 30, 2017 4:24 PM
> > To: Matan Azrad <ma...@mellanox.com>
> > Cc: dev@dpdk.org; Ophir Munk <ophi...@mellanox.com>
> > Subject: Re: [PATCH v3 6/7] net/mlx4: mitigate Tx path memory barriers
> > 
> > On Mon, Oct 30, 2017 at 10:07:28AM +0000, Matan Azrad wrote:
> > > Replace most of the memory barriers by compiler barriers since they
> > > are all targeted to the DRAM; This improves code efficiency for
> > > systems which force store order between different addresses.
> > >
> > > Only the doorbell record store should be protected by memory barrier
> > > since it is targeted to the PCI memory domain.
> > >
> > > Limit pre byte count store compiler barrier for systems with cache
> > > line size smaller than 64B (TXBB size).
> > >
> > > Signed-off-by: Matan Azrad <ma...@mellanox.com>
> > 
> > This sounds like an interesting performance improvement, can you share the
> > typical or expected amount (percentage/hard numbers) for a given use case
> > as part of the commit log?
> > 
> 
> Yes, it improves performance, I will share numbers.


First I must add I thought rte_io_[rw]mb() was really only a renamed
compiler barrier, I better understand its purpose now, thanks.

(more below.)

> > More comments below.
> > 
> > > ---
> > >  drivers/net/mlx4/mlx4_rxtx.c | 11 ++++++-----
> > >  1 file changed, 6 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/drivers/net/mlx4/mlx4_rxtx.c
> > > b/drivers/net/mlx4/mlx4_rxtx.c index 8ea8851..482c399 100644
> > > --- a/drivers/net/mlx4/mlx4_rxtx.c
> > > +++ b/drivers/net/mlx4/mlx4_rxtx.c
> > > @@ -168,7 +168,7 @@ struct pv {
> > >           /*
> > >            * Make sure we read the CQE after we read the ownership
> > bit.
> > >            */
> > > -         rte_rmb();
> > > +         rte_io_rmb();
> > 
> > OK for this one since the rest of the code should not be run due to the
> > condition (I'm not even sure even a compiler barrier is necessary at all 
> > here).
> > 
> > >  #ifndef NDEBUG
> > >           if (unlikely((cqe->owner_sr_opcode &
> > MLX4_CQE_OPCODE_MASK) ==
> > >                        MLX4_CQE_OPCODE_ERROR)) {
> > > @@ -203,7 +203,7 @@ struct pv {
> > >    */
> > >   cq->cons_index = cons_index;
> > >   *cq->set_ci_db = rte_cpu_to_be_32(cq->cons_index &
> > MLX4_CQ_DB_CI_MASK);
> > > - rte_wmb();
> > > + rte_io_wmb();
> > 
> > This one could be removed entirely as well, which is more or less what the
> > move to a compiler barrier does. Nothing in subsequent code depends on
> > this doorbell being written, so this can piggy back on any subsequent
> > rte_wmb().
> 
> Yes, you right, probably this code was taken from multi thread implementation.
> > 
> > On the other hand in my opinion a barrier (compiler or otherwise) might be
> > needed before the doorbell write, to make clear it cannot somehow be done
> > earlier in case something attempts to optimize it away.
> > 
> I think we can remove it entirely (compiler can't optimize the ci_db store 
> since in depends in previous code (cons_index).

Right, however you may still run into issues if the compiler determines the
final cons_index value by looking at the loop and decides to store it before
entering/leaving it. That's the kind of problematic optimization I was
thinking of.

The barrier in that sense is just to assert the order of seemingly unrelated
load/stores.

> > >   sq->tail = sq->tail + nr_txbbs;
> > >   /* Update the list of packets posted for transmission. */
> > >   elts_comp -= pkts;
> > > @@ -321,6 +321,7 @@ static int handle_multi_segs(struct rte_mbuf *buf,
> > >            * control segment.
> > >            */
> > >           if ((uintptr_t)dseg & (uintptr_t)(MLX4_TXBB_SIZE - 1)) {
> > > +#if RTE_CACHE_LINE_SIZE < 64
> > >                   /*
> > >                    * Need a barrier here before writing the byte_count
> > >                    * fields to make sure that all the data is visible @@ -
> > 331,6
> > > +332,7 @@ static int handle_multi_segs(struct rte_mbuf *buf,
> > >                    * data, and end up sending the wrong data.
> > >                    */
> > >                   rte_io_wmb();
> > > +#endif /* RTE_CACHE_LINE_SIZE */
> > 
> > Interesting one.
> > 
> > >                   dseg->byte_count = byte_count;
> > >           } else {
> > >                   /*
> > > @@ -469,8 +471,7 @@ static int handle_multi_segs(struct rte_mbuf *buf,
> > >                           break;
> > >                   }
> > >  #endif /* NDEBUG */
> > > -                 /* Need a barrier here before byte count store. */
> > > -                 rte_io_wmb();
> > > +                 /* Never be TXBB aligned, no need compiler barrier.
> > */
> > 
> > The reason there was a barrier here at all was unclear, so if it's really 
> > useless,
> > you don't even need to describe why.
> 
> It is because there is a barrier in multi segment similar stage.
> I think it can help for future review.

OK.

> > 
> > >                   dseg->byte_count = rte_cpu_to_be_32(buf-
> > >data_len);
> > >
> > >                   /* Fill the control parameters for this packet. */ @@ -
> > 533,7
> > > +534,7 @@ static int handle_multi_segs(struct rte_mbuf *buf,
> > >            * setting ownership bit (because HW can start
> > >            * executing as soon as we do).
> > >            */
> > > -         rte_wmb();
> > > +         rte_io_wmb();
> > 
> > This one looks dangerous. A compiler barrier is not strong enough to
> > guarantee the order in which CPU will execute instructions, it only makes
> > sure what follows the barrier doesn't appear before it in the generated 
> > code.
> > 
> As I investigated, I understood that for CPUs which don't save store order 
> between different addresses(arm,ppc), the rte_io_wmb is converted to rte_wmb.
> So for thus who save it(x86) we just need the right order in compiler code 
> because all the relevant stores are targeted to same memory domain(DRAM) and 
> therefore also the actual store is guaranteed.
> Unlike doorbell store which directed to different memory domain (PCI).
> So the only place which need rte_wmb() is before doorbell write.

Fair enough, although after re-reading the code I think there's another
issue present since the beginning: both ctrl and dseg pointers are not
volatile, this fact doesn't guarantee intermediate writes will occur in the
expected order or even at all, even in the presence of a barrier.

The volatile attribute should be inherited from both struct mlx4_cq and
struct mlx4_sq (buf, db and most if not all other pointers). I think a
separate fixes commit should add it for safety.

> > Unless the comment above this barrier is wrong, this change may cause hard-
> > to-debug issues down the road, you should drop it.
> > 
> > >           ctrl->owner_opcode = rte_cpu_to_be_32(owner_opcode |
> > >                                         ((sq->head & sq->txbb_cnt) ?
> > >                                                  MLX4_BIT_WQE_OWN :
> > 0));
> > > --
> > > 1.8.3.1
> > >
> > 
> > --
> > Adrien Mazarguil
> > 6WIND
> 
> Thanks!

-- 
Adrien Mazarguil
6WIND

Re: [dpdk-dev] [PATCH v3 6/7] net/mlx4: mitigate Tx path memory barriers

Reply via email to