On Mon, 2019-01-28 at 18:54 +0000, Eads, Gage wrote:
> 
> > 
> > -----Original Message-----
> > From: Ola Liljedahl [mailto:ola.liljed...@arm.com]
> > Sent: Monday, January 28, 2019 4:36 AM
> > To: jer...@marvell.com; mcze...@marvell.com; Eads, Gage
> > <gage.e...@intel.com>; dev@dpdk.org
> > Cc: olivier.m...@6wind.com; step...@networkplumber.org; nd
> > <n...@arm.com>; Richardson, Bruce <bruce.richard...@intel.com>;
> > arybche...@solarflare.com; Ananyev, Konstantin
> > <konstantin.anan...@intel.com>
> > Subject: Re: [dpdk-dev] [PATCH v3 2/5] ring: add a non-blocking
> > implementation
> > 
> > On Fri, 2019-01-25 at 17:21 +0000, Eads, Gage wrote:
> > > 
> > > 
> > > > 
> > > > 
> > > > -----Original Message-----
> > > > From: Ola Liljedahl [mailto:ola.liljed...@arm.com]
> > > > Sent: Wednesday, January 23, 2019 4:16 AM
> > > > To: Eads, Gage <gage.e...@intel.com>; dev@dpdk.org
> > > > Cc: olivier.m...@6wind.com; step...@networkplumber.org; nd
> > > > <n...@arm.com>; Richardson, Bruce <bruce.richard...@intel.com>;
> > > > arybche...@solarflare.com; Ananyev, Konstantin
> > > > <konstantin.anan...@intel.com>
> > > > Subject: Re: [dpdk-dev] [PATCH v3 2/5] ring: add a non-blocking
> > > > implementation
> > > > 
> > > > On Tue, 2019-01-22 at 21:31 +0000, Eads, Gage wrote:
> > > > > 
> > > > > 
> > > > > Hi Ola,
> > > > > 
> > > > > <snip>
> > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > @@ -331,6 +433,319 @@ void rte_ring_dump(FILE *f, const struct
> > > > > > > rte_ring *r);
> > > > > > >  #endif
> > > > > > >  #include "rte_ring_generic_64.h"
> > > > > > > 
> > > > > > > +/* @internal 128-bit structure used by the non-blocking ring
> > > > > > > +*/ struct nb_ring_entry {
> > > > > > > + void *ptr; /**< Data pointer */
> > > > > > > + uint64_t cnt; /**< Modification counter */
> > > > > > Why not make 'cnt' uintptr_t? This way 32-bit architectures will
> > > > > > also be supported. I think there are some claims that DPDK still
> > > > > > supports e.g.
> > > > > > ARMv7a
> > > > > > and possibly also 32-bit x86?
> > > > > I chose a 64-bit modification counter because (practically
> > > > > speaking) the ABA problem will not occur with such a large counter
> > > > > -- definitely not within my lifetime. See the "Discussion" section
> > > > > of the commit message for more information.
> > > > > 
> > > > > With a 32-bit counter, there is a very (very) low likelihood of
> > > > > it, but it is possible. Personally, I don't feel comfortable
> > > > > providing such code, because a) I doubt all users would understand
> > > > > the implementation well enough to do the risk/reward analysis, and
> > > > > b) such a bug would be near impossible to reproduce and root-cause
> > > > > if it did occur.
> > > > With a 64-bit counter (and 32-bit pointer), 32-bit architectures (e.g.
> > > > ARMv7a and
> > > > probably x86 as well) won't be able to support this as they at best
> > > > support 64-bit CAS (ARMv7a has LDREXD/STREXD). So you are
> > > > essentially putting a 64-bit (and 128-bit CAS) requirement on the
> > > > implementation.
> > > > 
> > > Yes, I am. I tried to make that clear in the cover letter.
> > > 
> > > > 
> > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > +};
> > > > > > > +
> > > > > > > +/* The non-blocking ring algorithm is based on the original
> > > > > > > +rte ring (derived
> > > > > > > + * from FreeBSD's bufring.h) and inspired by Michael and
> > > > > > > +Scott's non-blocking
> > > > > > > + * concurrent queue.
> > > > > > > + */
> > > > > > > +
> > > > > > > +/**
> > > > > > > + * @internal
> > > > > > > + *   Enqueue several objects on the non-blocking ring
> > > > > > > +(single-producer only)
> > > > > > > + *
> > > > > > > + * @param r
> > > > > > > + *   A pointer to the ring structure.
> > > > > > > + * @param obj_table
> > > > > > > + *   A pointer to a table of void * pointers (objects).
> > > > > > > + * @param n
> > > > > > > + *   The number of objects to add in the ring from the obj_table.
> > > > > > > + * @param behavior
> > > > > > > + *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items
> > > > > > > +to the ring
> > > > > > > + *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as
> > > > > > > +possible to the ring
> > > > > > > + * @param free_space
> > > > > > > + *   returns the amount of space after the enqueue operation
> > > > > > > +has finished
> > > > > > > + * @return
> > > > > > > + *   Actual number of objects enqueued.
> > > > > > > + *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n
> > > > > > > only.
> > > > > > > + */
> > > > > > > +static __rte_always_inline unsigned int
> > > > > > > +__rte_ring_do_nb_enqueue_sp(struct rte_ring *r, void * const
> > > > > > > *obj_table,
> > > > > > > +                     unsigned int n,
> > > > > > > +                     enum rte_ring_queue_behavior
> > > > > > > behavior,
> > > > > > > +                     unsigned int *free_space) {
> > > > > > > + uint32_t free_entries;
> > > > > > > + size_t head, next;
> > > > > > > +
> > > > > > > + n = __rte_ring_move_prod_head_64(r, 1, n, behavior,
> > > > > > > +                                  &head, &next,
> > > > > > > &free_entries);
> > > > > > > + if (n == 0)
> > > > > > > +         goto end;
> > > > > > > +
> > > > > > > + ENQUEUE_PTRS_NB(r, &r[1], head, obj_table, n);
> > > > > > > +
> > > > > > > + r->prod_64.tail += n;
> > > > > > Don't we need release order when (or smp_wmb between) writing of
> > > > > > the ring pointers and the update of tail? By updating the tail
> > > > > > pointer, we are synchronising with a consumer.
> > > > > > 
> > > > > > I prefer using __atomic operations even for load and store. You
> > > > > > can see which parts of the code that synchronise with each other,
> > > > > > e.g.
> > > > > > store-release to some location synchronises with load-acquire
> > > > > > from the same location. If you don't know how different threads
> > > > > > synchronise with each other, you are very likely to make mistakes.
> > > > > > 
> > > > > You can tell this code was written when I thought x86-64 was the
> > > > > only viable target :). Yes, you are correct.
> > > > > 
> > > > > With regards to using __atomic intrinsics, I'm planning on taking
> > > > > a similar approach to the functions duplicated in
> > > > > rte_ring_generic.h and
> > > > > rte_ring_c11_mem.h: one version that uses rte_atomic functions
> > > > > (and thus stricter memory ordering) and one that uses __atomic
> > > > > intrinsics (and thus can benefit from more relaxed memory ordering).
> > From a code point of view, I strongly prefer the atomic operations to be
> > visible
> > in the top level code, not hidden in subroutines. For correctness, it is
> > vital that
> > memory accesses are performed with the required ordering and that acquire
> > and
> > release matches up. Hiding e.g. load-acquire and store-release in
> > subroutines (in
> > a different file!) make this difficult. There have already been such bugs
> > found in
> > rte_ring.
> > 
> After working on the acq/rel ordering this weekend, I agree. This'll be
> easier/cleaner if we end up only using the C11 version.
Fabulous!

As I wrote in a response to Jerin, with a small cheat (LoadStore fence+store-
relaxed instead of store-release in the dequeue function where we only read
shared data in the critical section), C11 should provide the same ordering and
thus the same performance as the explicit barrier version. Benchmarking will
show.

> 
> > 
> > > 
> > > > 
> > > > What's the advantage of having two different implementations? What
> > > > is the disadvantage?
> > > > 
> > > > The existing ring buffer code originally had only the "legacy"
> > > > implementation
> > > > which was kept when the __atomic implementation was added. The
> > > > reason claimed was that some older compilers for x86 do not support
> > > > GCC __atomic builtins. But I thought there was consensus that new
> > > > functionality could have only __atomic implementations.
> > > > 
> > > When CONFIG_RTE_RING_USE_C11_MEM_MODEL was introduced, it was left
> > > disabled for thunderx[1] for performance reasons. Assuming that hasn't
> > > changed, the advantage to having two versions is to best support all of
> > > DPDK's
> > platforms.
> > > 
> > > The disadvantage is of course duplicated code and the additional
> > > maintenance burden.
> > The only way I see that a C11 memory model implementation can be slower
> > than using smp_wmb/rmb is if you need to order loads before a synchronizing
> > store and there are also outstanding stores which do not require ordering.
> > smp_rmb() handles this while store-release will also (unnecessarily) order
> > those
> > outstanding stores. This situation occurs e.g. in ring buffer dequeue
> > operations
> > where ring slots are read (and possibly written to thread-private memory)
> > before
> > the ring slots are release (e.g. using CAS-release or store-release).
> > 
> > I imagine that the LSU/cache subsystem on ThunderX/OCTEON-TX also have
> > something to do with this problem. If there are a large amounts of stores
> > pending in the load/store unit, store-release might have to wait for a long
> > time
> > before the synchronizing store can complete.
> > 
> > > 
> > > 
> > > That said, if the thunderx maintainers are ok with it, I'm certainly
> > > open to only doing the __atomic version. Note that even in the
> > > __atomic version, based on Honnapa's findings[2], using a DPDK-defined
> > > rte_atomic128_cmpset() (with additional arguments to support machines
> > > with weak consistency) appears to be a better option than
> > __atomic_compare_exchange_16.
> > __atomic_compare_exchange_16() is not guaranteed to be lock-free. It is not
> > lock-free on ARM/AArch64 and the support in GCC is formally broken (can't
> > use
> > cmpexchg16b to implement __atomic_load_16).
> > 
> > So yes, I think DPDK will have to define and implement the 128-bit atomic
> > compare and exchange operation (whatever it will be called). For
> > compatibility
> > with ARMv8.0, we can't require the "old" value returned by a failed compare-
> > exchange operation to be read atomically (LDXP does not guaranteed atomicity
> > by itself). But this is seldom a problem, many designs read the memory
> > location
> > using two separate 64-bit loads (so not atomic) anyway, it is a successful
> > atomic
> > compare exchange operation which provides atomicity.
> > 
> Ok. I agree, I don't expect that to be a problem. The 128-bit CAS patch I just
> submitted[1] (which was developed before reading this) will have to be
> changed.
> 
> [1] http://mails.dpdk.org/archives/dev/2019-January/124159.html
I will take a look and comnment on this.

> 
> Thanks,
> Gage
> 
> </snip>
-- 
Ola Liljedahl, Networking System Architect, Arm
Phone +46706866373, Skype ola.liljedahl

Reply via email to