Hi Matan, On Mon, Oct 30, 2017 at 07:47:20PM +0000, Matan Azrad wrote: > Hi Adrien > > > -----Original Message----- > > From: Adrien Mazarguil [mailto:adrien.mazarg...@6wind.com] > > Sent: Monday, October 30, 2017 4:24 PM > > To: Matan Azrad <ma...@mellanox.com> > > Cc: dev@dpdk.org; Ophir Munk <ophi...@mellanox.com> > > Subject: Re: [PATCH v3 6/7] net/mlx4: mitigate Tx path memory barriers > > > > On Mon, Oct 30, 2017 at 10:07:28AM +0000, Matan Azrad wrote: > > > Replace most of the memory barriers by compiler barriers since they > > > are all targeted to the DRAM; This improves code efficiency for > > > systems which force store order between different addresses. > > > > > > Only the doorbell record store should be protected by memory barrier > > > since it is targeted to the PCI memory domain. > > > > > > Limit pre byte count store compiler barrier for systems with cache > > > line size smaller than 64B (TXBB size). > > > > > > Signed-off-by: Matan Azrad <ma...@mellanox.com> > > > > This sounds like an interesting performance improvement, can you share the > > typical or expected amount (percentage/hard numbers) for a given use case > > as part of the commit log? > > > > Yes, it improves performance, I will share numbers.
First I must add I thought rte_io_[rw]mb() was really only a renamed compiler barrier, I better understand its purpose now, thanks. (more below.) > > More comments below. > > > > > --- > > > drivers/net/mlx4/mlx4_rxtx.c | 11 ++++++----- > > > 1 file changed, 6 insertions(+), 5 deletions(-) > > > > > > diff --git a/drivers/net/mlx4/mlx4_rxtx.c > > > b/drivers/net/mlx4/mlx4_rxtx.c index 8ea8851..482c399 100644 > > > --- a/drivers/net/mlx4/mlx4_rxtx.c > > > +++ b/drivers/net/mlx4/mlx4_rxtx.c > > > @@ -168,7 +168,7 @@ struct pv { > > > /* > > > * Make sure we read the CQE after we read the ownership > > bit. > > > */ > > > - rte_rmb(); > > > + rte_io_rmb(); > > > > OK for this one since the rest of the code should not be run due to the > > condition (I'm not even sure even a compiler barrier is necessary at all > > here). > > > > > #ifndef NDEBUG > > > if (unlikely((cqe->owner_sr_opcode & > > MLX4_CQE_OPCODE_MASK) == > > > MLX4_CQE_OPCODE_ERROR)) { > > > @@ -203,7 +203,7 @@ struct pv { > > > */ > > > cq->cons_index = cons_index; > > > *cq->set_ci_db = rte_cpu_to_be_32(cq->cons_index & > > MLX4_CQ_DB_CI_MASK); > > > - rte_wmb(); > > > + rte_io_wmb(); > > > > This one could be removed entirely as well, which is more or less what the > > move to a compiler barrier does. Nothing in subsequent code depends on > > this doorbell being written, so this can piggy back on any subsequent > > rte_wmb(). > > Yes, you right, probably this code was taken from multi thread implementation. > > > > On the other hand in my opinion a barrier (compiler or otherwise) might be > > needed before the doorbell write, to make clear it cannot somehow be done > > earlier in case something attempts to optimize it away. > > > I think we can remove it entirely (compiler can't optimize the ci_db store > since in depends in previous code (cons_index). Right, however you may still run into issues if the compiler determines the final cons_index value by looking at the loop and decides to store it before entering/leaving it. That's the kind of problematic optimization I was thinking of. The barrier in that sense is just to assert the order of seemingly unrelated load/stores. > > > sq->tail = sq->tail + nr_txbbs; > > > /* Update the list of packets posted for transmission. */ > > > elts_comp -= pkts; > > > @@ -321,6 +321,7 @@ static int handle_multi_segs(struct rte_mbuf *buf, > > > * control segment. > > > */ > > > if ((uintptr_t)dseg & (uintptr_t)(MLX4_TXBB_SIZE - 1)) { > > > +#if RTE_CACHE_LINE_SIZE < 64 > > > /* > > > * Need a barrier here before writing the byte_count > > > * fields to make sure that all the data is visible @@ - > > 331,6 > > > +332,7 @@ static int handle_multi_segs(struct rte_mbuf *buf, > > > * data, and end up sending the wrong data. > > > */ > > > rte_io_wmb(); > > > +#endif /* RTE_CACHE_LINE_SIZE */ > > > > Interesting one. > > > > > dseg->byte_count = byte_count; > > > } else { > > > /* > > > @@ -469,8 +471,7 @@ static int handle_multi_segs(struct rte_mbuf *buf, > > > break; > > > } > > > #endif /* NDEBUG */ > > > - /* Need a barrier here before byte count store. */ > > > - rte_io_wmb(); > > > + /* Never be TXBB aligned, no need compiler barrier. > > */ > > > > The reason there was a barrier here at all was unclear, so if it's really > > useless, > > you don't even need to describe why. > > It is because there is a barrier in multi segment similar stage. > I think it can help for future review. OK. > > > > > dseg->byte_count = rte_cpu_to_be_32(buf- > > >data_len); > > > > > > /* Fill the control parameters for this packet. */ @@ - > > 533,7 > > > +534,7 @@ static int handle_multi_segs(struct rte_mbuf *buf, > > > * setting ownership bit (because HW can start > > > * executing as soon as we do). > > > */ > > > - rte_wmb(); > > > + rte_io_wmb(); > > > > This one looks dangerous. A compiler barrier is not strong enough to > > guarantee the order in which CPU will execute instructions, it only makes > > sure what follows the barrier doesn't appear before it in the generated > > code. > > > As I investigated, I understood that for CPUs which don't save store order > between different addresses(arm,ppc), the rte_io_wmb is converted to rte_wmb. > So for thus who save it(x86) we just need the right order in compiler code > because all the relevant stores are targeted to same memory domain(DRAM) and > therefore also the actual store is guaranteed. > Unlike doorbell store which directed to different memory domain (PCI). > So the only place which need rte_wmb() is before doorbell write. Fair enough, although after re-reading the code I think there's another issue present since the beginning: both ctrl and dseg pointers are not volatile, this fact doesn't guarantee intermediate writes will occur in the expected order or even at all, even in the presence of a barrier. The volatile attribute should be inherited from both struct mlx4_cq and struct mlx4_sq (buf, db and most if not all other pointers). I think a separate fixes commit should add it for safety. > > Unless the comment above this barrier is wrong, this change may cause hard- > > to-debug issues down the road, you should drop it. > > > > > ctrl->owner_opcode = rte_cpu_to_be_32(owner_opcode | > > > ((sq->head & sq->txbb_cnt) ? > > > MLX4_BIT_WQE_OWN : > > 0)); > > > -- > > > 1.8.3.1 > > > > > > > -- > > Adrien Mazarguil > > 6WIND > > Thanks! -- Adrien Mazarguil 6WIND