On Wed, 17 Sep 2025 08:47:53 +0100 Bruce Richardson <bruce.richard...@intel.com> wrote:
> On Tue, Sep 16, 2025 at 06:19:41PM +0000, Ola Liljedahl wrote: > > (I am sending this from Outlook, I hope I have been able to fake a proper > > email client) > > > > > On 2025-09-16, 17:43, "Bruce Richardson" <bruce.richard...@intel.com > > > > <mailto:bruce.richard...@intel.com>> wrote: > > > > > > > > > On Mon, Sep 15, 2025 at 06:54:50PM +0000, Wathsala Vithanage wrote: > > > > The function __rte_ring_headtail_move_head() assumes that the barrier > > > > (fence) between the load of the head and the load-acquire of the > > > > opposing tail guarantees the following: if a first thread reads tail > > > > and then writes head and a second thread reads the new value of head > > > > and then reads tail, then it should observe the same (or a later) > > > > value of tail. > > > > > > > > This assumption is incorrect under the C11 memory model. If the barrier > > > > (fence) is intended to establish a total ordering of ring operations, > > > > it fails to do so. Instead, the current implementation only enforces a > > > > partial ordering, which can lead to unsafe interleavings. In particular, > > > > some partial orders can cause underflows in free slot or available > > > > element computations, potentially resulting in data corruption. > > > > > > > > The issue manifests when a CPU first acts as a producer and later as a > > > > consumer. In this scenario, the barrier assumption may fail when another > > > > core takes the consumer role. A Herd7 litmus test in C11 can demonstrate > > > > this violation. The problem has not been widely observed so far because: > > > > (a) on strong memory models (e.g., x86-64) the assumption holds, and > > > > (b) on relaxed models with RCsc semantics the ordering is still strong > > > > enough to prevent hazards. > > > > The problem becomes visible only on weaker models, when load-acquire is > > > > implemented with RCpc semantics (e.g. some AArch64 CPUs which support > > > > the LDAPR and LDAPUR instructions). > > > > > > > > Three possible solutions exist: > > > > 1. Strengthen ordering by upgrading release/acquire semantics to > > > > sequential consistency. This requires using seq-cst for stores, > > > > loads, and CAS operations. However, this approach introduces a > > > > significant performance penalty on relaxed-memory architectures. > > > > > > > > 2. Establish a safe partial order by enforcing a pair-wise > > > > happens-before relationship between thread of same role by changing > > > > the CAS and the preceding load of the head by converting them to > > > > release and acquire respectively. This approach makes the original > > > > barrier assumption unnecessary and allows its removal. > > > > > > > > 3. Retain partial ordering but ensure only safe partial orders are > > > > committed. This can be done by detecting underflow conditions > > > > (producer < consumer) and quashing the update in such cases. > > > > This approach makes the original barrier assumption unnecessary > > > > and allows its removal. > > > > > > > > This patch implements solution (3) for performance reasons. > > > > > > > > Signed-off-by: Wathsala Vithanage <wathsala.vithan...@arm.com > > > > <mailto:wathsala.vithan...@arm.com>> > > > > Signed-off-by: Ola Liljedahl <ola.liljed...@arm.com > > > > <mailto:ola.liljed...@arm.com>> > > > > Reviewed-by: Honnappa Nagarahalli <honnappa.nagaraha...@arm.com > > > > <mailto:honnappa.nagaraha...@arm.com>> > > > > Reviewed-by: Dhruv Tripathi <dhruv.tripa...@arm.com > > > > <mailto:dhruv.tripa...@arm.com>> > > > > --- > > > > lib/ring/rte_ring_c11_pvt.h | 10 +++++++--- > > > > 1 file changed, 7 insertions(+), 3 deletions(-) > > > > > > > Thank you for the very comprehensive write-up in the article about this. > > > It was very educational. > > > > > > > > > On the patch, are we sure that option #3 is safe to take as an approach? > > > It > > > seems wrong to me to deliberately leave ordering issues in the code and > > > just try and fix them up later. Would there be a noticable performance > > > difference for real-world apps if we took option #2, and actually used > > > correct ordering semantics? > > I am pretty sure that all necessary orderings for safely transferring > > elements > > from producers to consumers (and empty slots from consumers to producers) > > are present in the code (I still have some questions on the use of memory > > order relaxed in __rte_ring_update_tail, we should create a litmus test for > > this, to see what is required by the C memory model). What other metrics > > for correctness do you suggest? > > > > Not suggesting any other metrics for correctness. I'm instead just wanting > to double-check the cost-benefit of taking the approach of putting in a > fix-up, or workaround, in the code here, rather than actually correcting > the memory ordering use. Also, given that the workaround uses subtraction > to detect underflow, are we 100% sure that we have guaranteed correct > behaviour on counter wraparound at uint32_t max? If you look at the code rabbit review demo, it flagged the same possible underflow issue.