-----Original Message----- > Date: Tue, 17 Jul 2018 11:54:18 +0900 > From: Takeshi Yoshimura <t.yoshimura8...@gmail.com> > To: Jerin Jacob <jerin.ja...@caviumnetworks.com> > Cc: dev@dpdk.org, sta...@dpdk.org > Subject: Re: [dpdk-dev] [PATCH] rte_ring: fix racy dequeue/enqueue in ppc64 >
Cc: olivier.m...@6wind.com Cc: chao...@linux.vnet.ibm.com Cc: konstantin.anan...@intel.com > > > Adding rte_smp_rmb() cause performance regression on non x86 platforms. > > Having said that, load-load barrier can be expressed very well with C11 > > memory > > model. I guess ppc64 supports C11 memory model. If so, > > Could you try CONFIG_RTE_RING_USE_C11_MEM_MODEL=y for ppc64 and check > > original issue? > > Yes, the performance regression happens on non-x86 with single > producer/consumer. > The average latency of an enqueue was increased from 21 nsec to 24 nsec in my > simple experiment. But, I think it is worth it. That varies to machine to machine. What is the burst size etc. > > > I also tested C11 rte_ring, however, it caused the same race condition in > ppc64. > I tried to fix the C11 problem as well, but I also found the C11 > rte_ring had other potential > incorrect choices of memory orders, which caused another race > condition in ppc64. Does it happens on all ppc64 machines? Or on a specific machine? Is following tests are passing on your system without the patch? test/test/test_ring_perf.c test/test/test_ring.c > > For example, > __ATOMIC_ACQUIRE is passed to __atomic_compare_exchange_n(), but > I am not sure why the load-acquire is used for the compare exchange. It correct as per C11 acquire and release semantics. > Also in update_tail, the pause can be called before the data copy because > of ht->tail load without atomic_load_n. > > The memory order is simply difficult, so it might take a bit longer > time to check > if the code is correct. I think I can fix the C11 rte_ring as another patch. > > >> > >> SPDK blobfs encountered a crash around rte_ring dequeues in ppc64. > >> It uses a single consumer and multiple producers for a rte_ring. > >> The problem was a load-load reorder in rte_ring_sc_dequeue_bulk(). > > > > Adding rte_smp_rmb() cause performance regression on non x86 platforms. > > Having said that, load-load barrier can be expressed very well with C11 > > memory > > model. I guess ppc64 supports C11 memory model. If so, > > Could you try CONFIG_RTE_RING_USE_C11_MEM_MODEL=y for ppc64 and check > > original issue? > > > >> > >> The reordered loads happened on r->prod.tail in There is rte_smp_rmb() just before reading r->prod.tail in ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ _rte_ring_move_cons_head(). Would that not suffice the requirement? Can you check adding compiler barrier and see is compiler is reordering the stuff? DPDK's ring implementation is based freebsd's ring implementation, I don't see need for such barrier https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h If it is something specific to ppc64 or a specific ppc64 machine, we could add a compile option as it is arch specific to avoid performance impact on other architectures. > >> __rte_ring_move_cons_head() (rte_ring_generic.h) and ring[idx] in > >> DEQUEUE_PTRS() (rte_ring.h). They have a load-load control > >> dependency, but the code does not satisfy it. Note that they are > >> not reordered if __rte_ring_move_cons_head() with is_sc != 1 because > >> cmpset invokes a read barrier. > >> > >> The paired stores on these loads are in ENQUEUE_PTRS() and > >> update_tail(). Simplified code around the reorder is the following. > >> > >> Consumer Producer > >> load idx[ring] > >> store idx[ring] > >> store r->prod.tail > >> load r->prod.tail > >> > >> In this case, the consumer loads old idx[ring] and confirms the load > >> is valid with the new r->prod.tail. > >> > >> I added a read barrier in the case where __IS_SC is passed to > >> __rte_ring_move_cons_head(). I also fixed __rte_ring_move_prod_head() > >> to avoid similar problems with a single producer. > >> > >> Cc: sta...@dpdk.org > >> > >> Signed-off-by: Takeshi Yoshimura <t...@jp.ibm.com> > >> --- > >> lib/librte_ring/rte_ring_generic.h | 10 ++++++---- > >> 1 file changed, 6 insertions(+), 4 deletions(-) > >> > >> diff --git a/lib/librte_ring/rte_ring_generic.h > >> b/lib/librte_ring/rte_ring_generic.h > >> index ea7dbe5b9..477326180 100644 > >> --- a/lib/librte_ring/rte_ring_generic.h > >> +++ b/lib/librte_ring/rte_ring_generic.h > >> @@ -90,9 +90,10 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned > >> int is_sp, > >> return 0; > >> > >> *new_head = *old_head + n; > >> - if (is_sp) > >> + if (is_sp) { > >> + rte_smp_rmb(); > >> r->prod.head = *new_head, success = 1; > >> - else > >> + } else > >> success = rte_atomic32_cmpset(&r->prod.head, > >> *old_head, *new_head); > >> } while (unlikely(success == 0)); > >> @@ -158,9 +159,10 @@ __rte_ring_move_cons_head(struct rte_ring *r, > >> unsigned int is_sc, > >> return 0; > >> > >> *new_head = *old_head + n; > >> - if (is_sc) > >> + if (is_sc) { > >> + rte_smp_rmb(); > >> r->cons.head = *new_head, success = 1; > >> - else > >> + } else > >> success = rte_atomic32_cmpset(&r->cons.head, > >> *old_head, > >> *new_head); > >> } while (unlikely(success == 0)); > >> -- > >> 2.17.1 > >>