> From: Stephen Hemminger [mailto:step...@networkplumber.org]
> Sent: Tuesday, 21 May 2024 22.17
> 
> This header implements 64 bit counters using atomic
> operations but with a weak memory ordering so that
> they are safe against load/store splits on 32 bit platforms.
> 
> Signed-off-by: Stephen Hemminger <step...@networkplumber.org>
> Acked-by: Morten Brørup <m...@smartsharesystems.com>
> ---
>  lib/eal/include/meson.build   |   1 +
>  lib/eal/include/rte_counter.h | 150 ++++++++++++++++++++++++++++++++++
>  2 files changed, 151 insertions(+)
>  create mode 100644 lib/eal/include/rte_counter.h
> 
> diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
> index e94b056d46..c070dd0079 100644
> --- a/lib/eal/include/meson.build
> +++ b/lib/eal/include/meson.build
> @@ -12,6 +12,7 @@ headers += files(
>          'rte_class.h',
>          'rte_common.h',
>          'rte_compat.h',
> +        'rte_counter.h',
>          'rte_debug.h',
>          'rte_dev.h',
>          'rte_devargs.h',
> diff --git a/lib/eal/include/rte_counter.h b/lib/eal/include/rte_counter.h
> new file mode 100644
> index 0000000000..f0c2b71a6c
> --- /dev/null
> +++ b/lib/eal/include/rte_counter.h
> @@ -0,0 +1,150 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright (c) Stephen Hemminger <step...@networkplumber.org>
> + */
> +
> +#ifndef _RTE_COUNTER_H_
> +#define _RTE_COUNTER_H_
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#include <stdint.h>
> +#include <rte_compat.h>
> +#include <rte_common.h>
> +#include <rte_stdatomic.h>
> +
> +/**
> + * @file
> + * RTE Counter
> + *
> + * A counter is 64 bit value that is safe from split read/write.
> + * It assumes that only one cpu at a time  will update the counter,
> + * and another CPU may want to read it.
> + *
> + * This is a weaker subset of full atomic variables.
> + *
> + * The counters are subject to the restrictions of atomic variables
> + * in packed structures or unaligned.
> + */
> +
> +#ifdef RTE_ARCH_64
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * On native 64 bit platform, counter is implemented as basic
> + * 64 bit unsigned integer that only increases.
> + */
> +typedef struct {
> +     uint64_t current;
> +     uint64_t offset;
> +} rte_counter64_t;

As discussed in the other thread [1], I strongly prefer having "current" and 
"offset" separate, for performance reasons.
Keeping each offset close together with its counter will require more cache 
lines than necessary, because the offsets take up space in the hot part of a 
fast path data structure. E.g. the size_bins[] counters could fit into one 
cache line instead of two.

[1]: 
https://inbox.dpdk.org/dev/98cbd80474fa8b44bf855df32c47dc35e9f...@smartserver.smartshare.dk/

The disadvantages of a slightly larger API is insignificant, compared to the 
performance cost of not separating them.

> +
> +/**
> + * @internal
> + * Macro to implement read once (compiler barrier) using stdatomic.
> + * This is compiler barrier only.
> + */
> +#define __rte_read_once(var)                                         \
> +     rte_atomic_load_explicit((__rte_atomic typeof(&(var)))&(var),   \
> +             rte_memory_order_consume)
> +
> +/**
> + * @internal
> + * Macro to implement write once (compiler barrier) using stdatomic.
> + * This is compiler barrier only.
> + */
> +#define __rte_write_once(var, val)                                       \
> +     rte_atomic_store_explicit((__rte_atomic typeof(&(var)))&(var), val, \
> +             rte_memory_order_release)

These macros certainly make the code using them shorter.
But IMHO, they don't improve the readability of the code using them; quite the 
opposite. Reviewing code using atomics is hard, so I prefer having the memory 
order directly shown in the code, not hidden behind a macro.

> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Add value to counter.
> + * Assumes this operation is only done by one thread on the object.
> + *
> + * @param counter
> + *    A pointer to the counter.
> + * @param val
> + *    The value to add to the counter.
> + */
> +static inline void
> +rte_counter64_add(rte_counter64_t *counter, uint32_t val)
> +{
> +     counter->current += val;
> +}
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Read a counter.
> + * This operation can be done by any thread.
> + *
> + * @param counter
> + *    A pointer to the counter.
> + * @return
> + *  The current value of the counter.
> + */
> +__rte_experimental
> +static inline uint64_t
> +rte_counter64_read(const rte_counter64_t *counter)
> +{
> +     return __rte_read_once(counter->current) - __rte_read_once(counter-
> >offset);

I'm not sure that "current" needs to be read using rte_memory_order_consume 
here; I think rte_memory_order_consume for "offset" and 
rte_memory_order_relaxed (or perhaps just volatile) for "counter" suffices. But 
let's settle on the high level design before we start micro-optimizing. :-)

> +}
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Reset a counter to zero.
> + * This operation can be done by any thread.
> + *
> + * @param counter
> + *    A pointer to the counter.
> + */
> +__rte_experimental
> +static inline void
> +rte_counter64_reset(rte_counter64_t *counter)
> +{
> +     __rte_write_once(counter->offset, __rte_read_once(counter->current));
> +}
> +
> +#else
> +
> +/* On 32 bit platform, need to use atomic to avoid load/store tearing */
> +typedef RTE_ATOMIC(uint64_t) rte_counter64_t;

As shown by Godbolt experiments discussed in a previous thread [2], non-tearing 
64 bit counters can be implemented without using atomic instructions on all 32 
bit architectures supported by DPDK. So we should use the counter/offset design 
pattern for RTE_ARCH_32 too.

[2]: 
https://inbox.dpdk.org/dev/98cbd80474fa8b44bf855df32c47dc35e9f...@smartserver.smartshare.dk/

> +
> +__rte_experimental
> +static inline void
> +rte_counter64_add(rte_counter64_t *counter, uint32_t val)
> +{
> +     rte_atomic_fetch_add_explicit(counter, val, rte_memory_order_relaxed);
> +}
> +
> +__rte_experimental
> +static inline uint64_t
> +rte_counter64_read(const rte_counter64_t *counter)
> +{
> +     return rte_atomic_load_explicit(counter, rte_memory_order_consume);
> +}
> +
> +__rte_experimental
> +static inline void
> +rte_counter64_reset(rte_counter64_t *counter)
> +{
> +     rte_atomic_store_explicit(counter, 0, rte_memory_order_release);
> +}
> +
> +#endif /* RTE_ARCH_64 */
> +
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_COUNTER_H_ */
> --
> 2.43.0

Reply via email to