tail update

Konstantin Ananyev Mon, 22 Sep 2025 00:30:14 -0700


> > > >
> > > > To avoid information loss I combined reply to two Wathsala
> > > > replies into one.
> > > >
> > > >
> > > > > > > The function __rte_ring_headtail_move_head() assumes that
> > > > > > > the
> > > > > > > barrier
> > > > > > (fence) between the load of the head and the load-acquire of
> > > > > > the
> > > > > > > opposing tail guarantees the following: if a first thread
> > > > > > > reads
> > > > > > > tail
> > > > > > > and then writes head and a second thread reads the new
> > > > > > > value of
> > > > > > > head
> > > > > > > and then reads tail, then it should observe the same (or a
> > > > > > > later)
> > > > > > > value of tail.
> > > > > > >
> > > > > > > This assumption is incorrect under the C11 memory model. If
> > > > > > > the
> > > > > > > barrier
> > > > > > > (fence) is intended to establish a total ordering of ring
> > > > > > > operations,
> > > > > > > it fails to do so. Instead, the current implementation only
> > > > > > > enforces a
> > > > > > > partial ordering, which can lead to unsafe interleavings.
> > > > > > > In
> > > > > > > particular,
> > > > > > > some partial orders can cause underflows in free slot or
> > > > > > > available
> > > > > > > element computations, potentially resulting in data
> > > > > > > corruption.
> > > > > >
> > > > > > Hmm... sounds exactly like the problem from the patch we
> > > > > > discussed
> > > > > > earlier that year:
> > > > > > https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-
> 4
> > > > > > -
> > > [email protected] <mailto:20250521111432.207936-4-
> > > [email protected]>/
> > > > > > In two words:
> > > > > > "... thread can see 'latest' 'cons.head' value, with
> > > > > > 'previous' value
> > > > > > for 'prod.tail' or visa-versa.
> > > > > > In other words: 'cons.head' value depends on 'prod.tail', so
> > > > > > before
> > > > > > making latest 'cons.head'
> > > > > > value visible to other threads, we need to ensure that latest
> > > > > > 'prod.tail' is also visible."
> > > > > > Is that the one?
> > > >
> > > >
> > > > > Yes, the behavior occurs under RCpc (LDAPR) but not under RCsc
> > > > > (LDAR),
> > > > > which is why we didn’t catch it earlier. A fuller explanation,
> > > > > with
> > > > > Herd7 simulations, is in the blog post linked in the cover
> > > > > letter.
> > > > >
> > > > > https://community.arm.com/arm-community-blogs/b/architectures-and
> > > > > -
> > > processors-blog/posts/when-a-barrier-does-not-block-the-pitfalls-
> > > of-partial-order
> > > <https://community.arm.com/arm-community-blogs/b/architectures-and-
> > > processors-blog/posts/when-a-barrier-does-not-block-the-pitfalls-
> > > of-partial-order>
> > > >
> > > >
> > > > I see, so now it is reproducible with core rte_ring on real HW.
> > > >
> > > >
> > > > > >
> > > > > > > The issue manifests when a CPU first acts as a producer and
> > > > > > > later
> > > > > > > as a
> > > > > > > consumer. In this scenario, the barrier assumption may fail
> > > > > > > when
> > > > > > > another
> > > > > > > core takes the consumer role. A Herd7 litmus test in C11
> > > > > > > can
> > > > > > > demonstrate
> > > > > > > this violation. The problem has not been widely observed so
> > > > > > > far
> > > > > > > because:
> > > > > > > (a) on strong memory models (e.g., x86-64) the assumption
> > > > > > > holds,
> > > > > > > and
> > > > > > > (b) on relaxed models with RCsc semantics the ordering is
> > > > > > > still
> > > > > > > strong
> > > > > > > enough to prevent hazards.
> > > > > > > The problem becomes visible only on weaker models, when
> > > > > > > load-
> > > > > > > acquire is
> > > > > > > implemented with RCpc semantics (e.g. some AArch64 CPUs
> > > > > > > which
> > > > > > > support
> > > > > > > the LDAPR and LDAPUR instructions).
> > > > > > >
> > > > > > > Three possible solutions exist:
> > > > > > > 1. Strengthen ordering by upgrading release/acquire
> > > > > > > semantics to
> > > > > > > sequential consistency. This requires using seq-cst for
> > > > > > > stores,
> > > > > > > loads, and CAS operations. However, this approach
> > > > > > > introduces a
> > > > > > > significant performance penalty on relaxed-memory
> > > > > > > architectures.
> > > > > > >
> > > > > > > 2. Establish a safe partial order by enforcing a pair-wise
> > > > > > > happens-before relationship between thread of same role by
> > > > > > > changing
> > > > > > > the CAS and the preceding load of the head by converting
> > > > > > > them
> > > > > > > to
> > > > > > > release and acquire respectively. This approach makes the
> > > > > > > original
> > > > > > > barrier assumption unnecessary and allows its removal.
> > > > > >
> > > > > > For the sake of clarity, can you outline what would be exact
> > > > > > code
> > > > > > changes for
> > > > > > approach #2? Same as in that patch:
> > > > > > https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-
> 4
> > > > > > -
> > > <
> > > https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-
> > > 4->
> > > > > [email protected] <mailto:
> > > > > [email protected]>/
> > > > > > Or something different?
> > > > >
> > > > > Sorry, I missed the later half you your comment before.
> > > > > Yes, you have proposed the same solution there.
> > > >
> > > >
> > > > Ok, thanks for confirmation.
> > > >
> > > >
> > > > > >
> > > > > >
> > > > > > > 3. Retain partial ordering but ensure only safe partial
> > > > > > > orders
> > > > > > > are
> > > > > > > committed. This can be done by detecting underflow
> > > > > > > conditions
> > > > > > > (producer < consumer) and quashing the update in such
> > > > > > > cases.
> > > > > > > This approach makes the original barrier assumption
> > > > > > > unnecessary
> > > > > > > and allows its removal.
> > > > > >
> > > > > > > This patch implements solution (3) for performance reasons.
> > > > > > >
> > > > > > > Signed-off-by: Wathsala Vithanage
> > > > > > > <[email protected]
> > > <mailto:[email protected]>>
> > > > > > > Signed-off-by: Ola Liljedahl <[email protected]
> > > <mailto:[email protected]>>
> > > > > > > Reviewed-by: Honnappa Nagarahalli
> > > > > > > <[email protected]
> > > <mailto:[email protected]>>
> > > > > > > Reviewed-by: Dhruv Tripathi <[email protected]
> > > <mailto:[email protected]>>
> > > > > > > ---
> > > > > > > lib/ring/rte_ring_c11_pvt.h | 10 +++++++---
> > > > > > > 1 file changed, 7 insertions(+), 3 deletions(-)
> > > > > > >
> > > > > > > diff --git a/lib/ring/rte_ring_c11_pvt.h
> > > > > > > b/lib/ring/rte_ring_c11_pvt.h
> > > > > > > index b9388af0da..e5ac1f6b9e 100644
> > > > > > > --- a/lib/ring/rte_ring_c11_pvt.h
> > > > > > > +++ b/lib/ring/rte_ring_c11_pvt.h
> > > > > > > @@ -83,9 +83,6 @@ __rte_ring_headtail_move_head(struct
> > > > > > > rte_ring_headtail
> > > > > > > *d,
> > > > > > > /* Reset n to the initial burst count */
> > > > > > > n = max;
> > > > > > >
> > > > > > > - /* Ensure the head is read before tail */
> > > > > > > - rte_atomic_thread_fence(rte_memory_order_acquire);
> > > > > > > -
> > > > > > > /* load-acquire synchronize with store-release of
> > > > > > > ht->tail
> > > > > > > * in update_tail.
> > > > > > > */
> > > > > >
> > > > > > But then cons.head can be read a before prod.tail (and visa-
> > > > > > versa),
> > > > > > right?
> > > > >
> > > > > Right, we let it happen but eliminate any resulting states that
> > > > > are
> > > > > semantically incorrect at the end.
> > > >
> > > >
> > > > Two comments here:
> > > > 1) I think it is probably safer to do the check like that:
> > > > If (*entries > ring->capacity) ...
> > > Yes, this might be another way of handling underflow situations. We
> > > could study
> > > this.
> > >
> > > I have used the check for negative without problems in my ring
> > > buffer
> > > implementations
> > > https://github.com/ARM-software/progress64/blob/master/src/p64_ringbuf.c
> > > but can't say that has been battle-tested.
> >
> > My thought was about the case (probably hypothetical) when the
> > difference
> > between stale tail and head will be bigger then 2^31 + 1.
> >
> > > > 2) My concern that without forcing a proper read ordering
> > > > (cons.head first then prod.tail) we re-introduce a window for all
> > > > sorts of
> > > > ABA-like problems.
> 
> The ABA like problem you are referring here is index wrapping around
> then reaching the point of not resulting a negative value I assume.
> If so, the distance between a stale (tail) and a wrapped around head
> has to be at least 0x80000000.


Yes, I think so. 

> > > Head and tail indexes are monotonically increasing so I don't see a
> > > risk for ABA-like
> > > problems.
> >
> > I understand that, but with current CPU speeds it can take rte_ring
> > just few seconds to
> > wrap around head/tail values. If user doing something really fancy -
> > like using rte_ring ZC API
> > (i.e. just moving head/tail without reading actual objects) that can
> > probably happen even
> > faster (less than a second?).
> > Are we sure that the stale tail value will never persist that long?
> > Let say user calling move_head() in a loop till it succeeds?
> >
> 
> Systems with fast CPUs may also have shorter window of inconsistency.
> 
> > > Indeed, adding a monotonically increasing tag to pointers is the
> > > common way of
> > > avoiding ABA
> > > problems in lock-free designs.
> >
> > Yep, using 64-bit values for head/tail counters will help to avoid
> > these concerns.
> > But it will probably break HTS/RTS modes, plus it is an ABI change
> > for sure.
> >
> > Actually after another thought, I have one more concern here:
> >
> > +               /*
> > +                * Ensure the entries calculation was not based on a
> > stale
> > +                * and unsafe stail observation that causes
> > underflow.
> > +                */
> > +               if ((int)*entries < 0)
> > +                       *entries = 0;
> > +
> >
> > With that change, it might return not-valid information back to the
> > user
> > about number of free/occupied entries in the ring.
> > Plus rte_ring_enqueue() now might fail even when there are enough
> > free entries
> > in the ring (same for dequeue).
> > That looks like a change in our public API behavior that might break
> > many things.
> > There are quite few places when caller expects enqueue/dequeue
> > operation to always succeed (let say there always should be enough
> > free space in the ring).
> > For example: rte_mempool works like that.
> > I am pretty sure there are quite few other places like that inside
> > DPDK,
> > not to mention third-party code.
> >
> > Considering all of the above, I am actually more in favor
> > to combine approaches #2 and #3 for the final patch:
> > establish a safe partial order (#2) and keep the check from #3
> > (should it become an assert()/verify()?)
> >
> 
> #2 is OK too, difference in performance is only meaningful when at
> single core Figure 12 in blog post.

Yes, same thought here.

> But why combine #3? Litmus tests prove that this state won't be reached
> in #2 (Figure 10).

My intention was to convert it to RTE_ASSERT().
To help people catch situations when things went completely wrong.
Though if you believe it is completely unnecessary - I am ok with pure #2 
approach. 

> > Another thing to note: whatever final approach we choose -
> > we need to make sure that the problem is addressed across all other
> > rte_ring flavors/modes too (generic implementation, rts/hts mode,
> > soring).
> 
> Agree.

Ok, ping me if you need a hand with v2.
Thanks
Konstantin

RE: [PATCH 1/1] ring: safe partial ordering for head/tail update

Reply via email to