> This patch replaces use of the deprecated rte_atomic32 code with
> GCC builtin atomic operations.
>
> Although it would be preferable to use C11 version on all architectures,
> there is a performance loss if we do it that way:
>
> Measured on i9-13900H, two physical cores MP/MC bulk n=128, 10 runs:
> with C11 builtin: 5.86 cycles/elem
> with __sync builtin: 5.36 cycles/elem (-9.4%)
>
> The C11 __atomic_compare_exchange_n builtin writes the actual value back
> to its expected pointer on failure. On x86 this forces GCC
> to emit extra instructions on the critical path between the CAS
> and the success-test.
>
> __sync_bool_compare_and_swap returns a plain bool with no pointer
> writeback, allowing GCC to emit tighter code.
>
> Signed-off-by: Stephen Hemminger <[email protected]>
> ---
> lib/ring/meson.build | 2 +-
> lib/ring/rte_ring_c11_pvt.h | 3 +-
> lib/ring/rte_ring_elem_pvt.h | 2 +-
> ..._ring_generic_pvt.h => rte_ring_gcc_pvt.h} | 37 +++++++++++--------
> 4 files changed, 24 insertions(+), 20 deletions(-)
> rename lib/ring/{rte_ring_generic_pvt.h => rte_ring_gcc_pvt.h} (87%)
>
> diff --git a/lib/ring/meson.build b/lib/ring/meson.build
> index 21f2c12989..2ba160b178 100644
> --- a/lib/ring/meson.build
> +++ b/lib/ring/meson.build
> @@ -9,7 +9,7 @@ indirect_headers += files (
> 'rte_ring_elem.h',
> 'rte_ring_elem_pvt.h',
> 'rte_ring_c11_pvt.h',
> - 'rte_ring_generic_pvt.h',
> + 'rte_ring_gcc_pvt.h',
> 'rte_ring_hts.h',
> 'rte_ring_hts_elem_pvt.h',
> 'rte_ring_peek.h',
> diff --git a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h
> index 5afc14dec9..8358b0f21f 100644
> --- a/lib/ring/rte_ring_c11_pvt.h
> +++ b/lib/ring/rte_ring_c11_pvt.h
> @@ -43,7 +43,6 @@ __rte_ring_update_tail(struct rte_ring_headtail *ht,
> uint32_t old_val,
> */
> rte_atomic_store_explicit(&ht->tail, new_val,
> rte_memory_order_release);
> }
> -
> /**
> * @internal This is a helper function that moves the producer/consumer head
> * optimized for single threaded case
> @@ -82,7 +81,7 @@ __rte_ring_headtail_move_head_st(struct rte_ring_headtail
> *d,
> /* Single producer: only this thread writes d->head,
> * so a relaxed load is sufficient.
> */
> - *old_head = rte_atomic_load_explicit(&d->head,
> rte_memory_order_relaxed);
> + *old_head = rte_atomic_load_explicit(&d->head,
> rte_memory_order_acquire);
Not sure, why it had changed to 'acquire' here?
Looks like just patch splitting mistake, no?
>
> /* Acquire pairs with the consumer's release-store of tail in
> __rte_ring_update_tail,
> * ensuring the consumer's ring-element reads are complete before