On Wed, 2025-09-17 at 08:47 +0100, Bruce Richardson wrote:
> On Tue, Sep 16, 2025 at 06:19:41PM +0000, Ola Liljedahl wrote:
> > (I am sending this from Outlook, I hope I have been able to fake a
> > proper
> > email client)
> >
> > > On 2025-09-16, 17:43, "Bruce Richardson"
> > > <bruce.richard...@intel.com >
> > > <mailto:bruce.richard...@intel.com>> wrote:
> > >
> > >
> > > On Mon, Sep 15, 2025 at 06:54:50PM +0000, Wathsala Vithanage
> > > wrote:
> > > > The function __rte_ring_headtail_move_head() assumes that the
> > > > barrier
> > > > (fence) between the load of the head and the load-acquire of
> > > > the
> > > > opposing tail guarantees the following: if a first thread reads
> > > > tail
> > > > and then writes head and a second thread reads the new value of
> > > > head
> > > > and then reads tail, then it should observe the same (or a
> > > > later)
> > > > value of tail.
> > > >
> > > > This assumption is incorrect under the C11 memory model. If the
> > > > barrier
> > > > (fence) is intended to establish a total ordering of ring
> > > > operations,
> > > > it fails to do so. Instead, the current implementation only
> > > > enforces a
> > > > partial ordering, which can lead to unsafe interleavings. In
> > > > particular,
> > > > some partial orders can cause underflows in free slot or
> > > > available
> > > > element computations, potentially resulting in data corruption.
> > > >
> > > > The issue manifests when a CPU first acts as a producer and
> > > > later as a
> > > > consumer. In this scenario, the barrier assumption may fail
> > > > when another
> > > > core takes the consumer role. A Herd7 litmus test in C11 can
> > > > demonstrate
> > > > this violation. The problem has not been widely observed so far
> > > > because:
> > > > (a) on strong memory models (e.g., x86-64) the assumption
> > > > holds, and
> > > > (b) on relaxed models with RCsc semantics the ordering is still
> > > > strong
> > > > enough to prevent hazards.
> > > > The problem becomes visible only on weaker models, when load-
> > > > acquire is
> > > > implemented with RCpc semantics (e.g. some AArch64 CPUs which
> > > > support
> > > > the LDAPR and LDAPUR instructions).
> > > >
> > > > Three possible solutions exist:
> > > > 1. Strengthen ordering by upgrading release/acquire semantics
> > > > to
> > > > sequential consistency. This requires using seq-cst for stores,
> > > > loads, and CAS operations. However, this approach introduces a
> > > > significant performance penalty on relaxed-memory
> > > > architectures.
> > > >
> > > > 2. Establish a safe partial order by enforcing a pair-wise
> > > > happens-before relationship between thread of same role by
> > > > changing
> > > > the CAS and the preceding load of the head by converting them
> > > > to
> > > > release and acquire respectively. This approach makes the
> > > > original
> > > > barrier assumption unnecessary and allows its removal.
> > > >
> > > > 3. Retain partial ordering but ensure only safe partial orders
> > > > are
> > > > committed. This can be done by detecting underflow conditions
> > > > (producer < consumer) and quashing the update in such cases.
> > > > This approach makes the original barrier assumption unnecessary
> > > > and allows its removal.
> > > >
> > > > This patch implements solution (3) for performance reasons.
> > > >
> > > > Signed-off-by: Wathsala Vithanage
> > > > <wathsala.vithan...@arm.com <mailto:
> > > > wathsala.vithan...@arm.com>>
> > > > Signed-off-by: Ola Liljedahl
> > > > <ola.liljed...@arm.com <mailto:ola.liljed...@arm.com>>
> > > > Reviewed-by: Honnappa Nagarahalli
> > > > <honnappa.nagaraha...@arm.com <mailto:
> > > > honnappa.nagaraha...@arm.com>>
> > > > Reviewed-by: Dhruv Tripathi
> > > > <dhruv.tripa...@arm.com <mailto:dhruv.tripa...@arm.com>>
> > > > ---
> > > > lib/ring/rte_ring_c11_pvt.h | 10 +++++++---
> > > > 1 file changed, 7 insertions(+), 3 deletions(-)
> > > >
> > > Thank you for the very comprehensive write-up in the article
> > > about this.
> > > It was very educational.
> > >
> > >
> > > On the patch, are we sure that option #3 is safe to take as an
> > > approach? It
> > > seems wrong to me to deliberately leave ordering issues in the
> > > code and
> > > just try and fix them up later. Would there be a noticable
> > > performance
> > > difference for real-world apps if we took option #2, and actually
> > > used
> > > correct ordering semantics?
> > I am pretty sure that all necessary orderings for safely
> > transferring elements
> > from producers to consumers (and empty slots from consumers to
> > producers)
> > are present in the code (I still have some questions on the use of
> > memory
> > order relaxed in __rte_ring_update_tail, we should create a litmus
> > test for
> > this, to see what is required by the C memory model). What other
> > metrics
> > for correctness do you suggest?
> >
>
> Not suggesting any other metrics for correctness. I'm instead just
> wanting
> to double-check the cost-benefit of taking the approach of putting in
> a
> fix-up, or workaround, in the code here, rather than actually
> correcting
> the memory ordering use.  Also, given that the workaround uses
> subtraction
> to detect underflow, are we 100% sure that we have guaranteed correct
> behaviour on counter wraparound at uint32_t max?
>

I agree—that's a valid concern. If the consumer reads a stale
producer_tail of 0xffffffff and the index has since wrapped to N, the
negative check detects the partial-order effect for N < 0x80000000 and,
not for N ≥ 0x80000000. In practice the likelihood of the latter is
unlikely.
However, comparing against the ring capacity would be even safer, since
it covers a larger range for N.

--wathsala


IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.

Reply via email to