> > > > > > > > To avoid information loss I combined reply to two Wathsala > > > > replies into one. > > > > > > > > > > > > > > > The function __rte_ring_headtail_move_head() assumes that > > > > > > > the > > > > > > > barrier > > > > > > (fence) between the load of the head and the load-acquire of > > > > > > the > > > > > > > opposing tail guarantees the following: if a first thread > > > > > > > reads > > > > > > > tail > > > > > > > and then writes head and a second thread reads the new > > > > > > > value of > > > > > > > head > > > > > > > and then reads tail, then it should observe the same (or a > > > > > > > later) > > > > > > > value of tail. > > > > > > > > > > > > > > This assumption is incorrect under the C11 memory model. If > > > > > > > the > > > > > > > barrier > > > > > > > (fence) is intended to establish a total ordering of ring > > > > > > > operations, > > > > > > > it fails to do so. Instead, the current implementation only > > > > > > > enforces a > > > > > > > partial ordering, which can lead to unsafe interleavings. > > > > > > > In > > > > > > > particular, > > > > > > > some partial orders can cause underflows in free slot or > > > > > > > available > > > > > > > element computations, potentially resulting in data > > > > > > > corruption. > > > > > > > > > > > > Hmm... sounds exactly like the problem from the patch we > > > > > > discussed > > > > > > earlier that year: > > > > > > https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936- > 4 > > > > > > - > > > [email protected] <mailto:20250521111432.207936-4- > > > [email protected]>/ > > > > > > In two words: > > > > > > "... thread can see 'latest' 'cons.head' value, with > > > > > > 'previous' value > > > > > > for 'prod.tail' or visa-versa. > > > > > > In other words: 'cons.head' value depends on 'prod.tail', so > > > > > > before > > > > > > making latest 'cons.head' > > > > > > value visible to other threads, we need to ensure that latest > > > > > > 'prod.tail' is also visible." > > > > > > Is that the one? > > > > > > > > > > > > > Yes, the behavior occurs under RCpc (LDAPR) but not under RCsc > > > > > (LDAR), > > > > > which is why we didn’t catch it earlier. A fuller explanation, > > > > > with > > > > > Herd7 simulations, is in the blog post linked in the cover > > > > > letter. > > > > > > > > > > https://community.arm.com/arm-community-blogs/b/architectures-and > > > > > - > > > processors-blog/posts/when-a-barrier-does-not-block-the-pitfalls- > > > of-partial-order > > > <https://community.arm.com/arm-community-blogs/b/architectures-and- > > > processors-blog/posts/when-a-barrier-does-not-block-the-pitfalls- > > > of-partial-order> > > > > > > > > > > > > I see, so now it is reproducible with core rte_ring on real HW. > > > > > > > > > > > > > > > > > > > > > The issue manifests when a CPU first acts as a producer and > > > > > > > later > > > > > > > as a > > > > > > > consumer. In this scenario, the barrier assumption may fail > > > > > > > when > > > > > > > another > > > > > > > core takes the consumer role. A Herd7 litmus test in C11 > > > > > > > can > > > > > > > demonstrate > > > > > > > this violation. The problem has not been widely observed so > > > > > > > far > > > > > > > because: > > > > > > > (a) on strong memory models (e.g., x86-64) the assumption > > > > > > > holds, > > > > > > > and > > > > > > > (b) on relaxed models with RCsc semantics the ordering is > > > > > > > still > > > > > > > strong > > > > > > > enough to prevent hazards. > > > > > > > The problem becomes visible only on weaker models, when > > > > > > > load- > > > > > > > acquire is > > > > > > > implemented with RCpc semantics (e.g. some AArch64 CPUs > > > > > > > which > > > > > > > support > > > > > > > the LDAPR and LDAPUR instructions). > > > > > > > > > > > > > > Three possible solutions exist: > > > > > > > 1. Strengthen ordering by upgrading release/acquire > > > > > > > semantics to > > > > > > > sequential consistency. This requires using seq-cst for > > > > > > > stores, > > > > > > > loads, and CAS operations. However, this approach > > > > > > > introduces a > > > > > > > significant performance penalty on relaxed-memory > > > > > > > architectures. > > > > > > > > > > > > > > 2. Establish a safe partial order by enforcing a pair-wise > > > > > > > happens-before relationship between thread of same role by > > > > > > > changing > > > > > > > the CAS and the preceding load of the head by converting > > > > > > > them > > > > > > > to > > > > > > > release and acquire respectively. This approach makes the > > > > > > > original > > > > > > > barrier assumption unnecessary and allows its removal. > > > > > > > > > > > > For the sake of clarity, can you outline what would be exact > > > > > > code > > > > > > changes for > > > > > > approach #2? Same as in that patch: > > > > > > https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936- > 4 > > > > > > - > > > < > > > https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936- > > > 4-> > > > > > [email protected] <mailto: > > > > > [email protected]>/ > > > > > > Or something different? > > > > > > > > > > Sorry, I missed the later half you your comment before. > > > > > Yes, you have proposed the same solution there. > > > > > > > > > > > > Ok, thanks for confirmation. > > > > > > > > > > > > > > > > > > > > > > > > > > > 3. Retain partial ordering but ensure only safe partial > > > > > > > orders > > > > > > > are > > > > > > > committed. This can be done by detecting underflow > > > > > > > conditions > > > > > > > (producer < consumer) and quashing the update in such > > > > > > > cases. > > > > > > > This approach makes the original barrier assumption > > > > > > > unnecessary > > > > > > > and allows its removal. > > > > > > > > > > > > > This patch implements solution (3) for performance reasons. > > > > > > > > > > > > > > Signed-off-by: Wathsala Vithanage > > > > > > > <[email protected] > > > <mailto:[email protected]>> > > > > > > > Signed-off-by: Ola Liljedahl <[email protected] > > > <mailto:[email protected]>> > > > > > > > Reviewed-by: Honnappa Nagarahalli > > > > > > > <[email protected] > > > <mailto:[email protected]>> > > > > > > > Reviewed-by: Dhruv Tripathi <[email protected] > > > <mailto:[email protected]>> > > > > > > > --- > > > > > > > lib/ring/rte_ring_c11_pvt.h | 10 +++++++--- > > > > > > > 1 file changed, 7 insertions(+), 3 deletions(-) > > > > > > > > > > > > > > diff --git a/lib/ring/rte_ring_c11_pvt.h > > > > > > > b/lib/ring/rte_ring_c11_pvt.h > > > > > > > index b9388af0da..e5ac1f6b9e 100644 > > > > > > > --- a/lib/ring/rte_ring_c11_pvt.h > > > > > > > +++ b/lib/ring/rte_ring_c11_pvt.h > > > > > > > @@ -83,9 +83,6 @@ __rte_ring_headtail_move_head(struct > > > > > > > rte_ring_headtail > > > > > > > *d, > > > > > > > /* Reset n to the initial burst count */ > > > > > > > n = max; > > > > > > > > > > > > > > - /* Ensure the head is read before tail */ > > > > > > > - rte_atomic_thread_fence(rte_memory_order_acquire); > > > > > > > - > > > > > > > /* load-acquire synchronize with store-release of > > > > > > > ht->tail > > > > > > > * in update_tail. > > > > > > > */ > > > > > > > > > > > > But then cons.head can be read a before prod.tail (and visa- > > > > > > versa), > > > > > > right? > > > > > > > > > > Right, we let it happen but eliminate any resulting states that > > > > > are > > > > > semantically incorrect at the end. > > > > > > > > > > > > Two comments here: > > > > 1) I think it is probably safer to do the check like that: > > > > If (*entries > ring->capacity) ... > > > Yes, this might be another way of handling underflow situations. We > > > could study > > > this. > > > > > > I have used the check for negative without problems in my ring > > > buffer > > > implementations > > > https://github.com/ARM-software/progress64/blob/master/src/p64_ringbuf.c > > > but can't say that has been battle-tested. > > > > My thought was about the case (probably hypothetical) when the > > difference > > between stale tail and head will be bigger then 2^31 + 1. > > > > > > 2) My concern that without forcing a proper read ordering > > > > (cons.head first then prod.tail) we re-introduce a window for all > > > > sorts of > > > > ABA-like problems. > > The ABA like problem you are referring here is index wrapping around > then reaching the point of not resulting a negative value I assume. > If so, the distance between a stale (tail) and a wrapped around head > has to be at least 0x80000000.
Yes, I think so. > > > Head and tail indexes are monotonically increasing so I don't see a > > > risk for ABA-like > > > problems. > > > > I understand that, but with current CPU speeds it can take rte_ring > > just few seconds to > > wrap around head/tail values. If user doing something really fancy - > > like using rte_ring ZC API > > (i.e. just moving head/tail without reading actual objects) that can > > probably happen even > > faster (less than a second?). > > Are we sure that the stale tail value will never persist that long? > > Let say user calling move_head() in a loop till it succeeds? > > > > Systems with fast CPUs may also have shorter window of inconsistency. > > > > Indeed, adding a monotonically increasing tag to pointers is the > > > common way of > > > avoiding ABA > > > problems in lock-free designs. > > > > Yep, using 64-bit values for head/tail counters will help to avoid > > these concerns. > > But it will probably break HTS/RTS modes, plus it is an ABI change > > for sure. > > > > Actually after another thought, I have one more concern here: > > > > + /* > > + * Ensure the entries calculation was not based on a > > stale > > + * and unsafe stail observation that causes > > underflow. > > + */ > > + if ((int)*entries < 0) > > + *entries = 0; > > + > > > > With that change, it might return not-valid information back to the > > user > > about number of free/occupied entries in the ring. > > Plus rte_ring_enqueue() now might fail even when there are enough > > free entries > > in the ring (same for dequeue). > > That looks like a change in our public API behavior that might break > > many things. > > There are quite few places when caller expects enqueue/dequeue > > operation to always succeed (let say there always should be enough > > free space in the ring). > > For example: rte_mempool works like that. > > I am pretty sure there are quite few other places like that inside > > DPDK, > > not to mention third-party code. > > > > Considering all of the above, I am actually more in favor > > to combine approaches #2 and #3 for the final patch: > > establish a safe partial order (#2) and keep the check from #3 > > (should it become an assert()/verify()?) > > > > #2 is OK too, difference in performance is only meaningful when at > single core Figure 12 in blog post. Yes, same thought here. > But why combine #3? Litmus tests prove that this state won't be reached > in #2 (Figure 10). My intention was to convert it to RTE_ASSERT(). To help people catch situations when things went completely wrong. Though if you believe it is completely unnecessary - I am ok with pure #2 approach. > > Another thing to note: whatever final approach we choose - > > we need to make sure that the problem is addressed across all other > > rte_ring flavors/modes too (generic implementation, rts/hts mode, > > soring). > > Agree. Ok, ping me if you need a hand with v2. Thanks Konstantin

