On 2024-05-22 21:01, Tyler Retzlaff wrote:
On Wed, May 22, 2024 at 07:57:01PM +0200, Morten Brørup wrote:
From: Stephen Hemminger [mailto:step...@networkplumber.org]
Sent: Wednesday, 22 May 2024 17.38

On Wed, 22 May 2024 10:31:39 +0200
Morten Brørup <m...@smartsharesystems.com> wrote:

+/* On 32 bit platform, need to use atomic to avoid load/store
tearing */
+typedef RTE_ATOMIC(uint64_t) rte_counter64_t;

As shown by Godbolt experiments discussed in a previous thread [2],
non-tearing 64 bit counters can be implemented without using atomic
instructions on all 32 bit architectures supported by DPDK. So we should
use the counter/offset design pattern for RTE_ARCH_32 too.

[2]:
https://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35E9F433@smarts
erver.smartshare.dk/


This code built with -O3 and -m32 on godbolt shows split problem.

#include <stdint.h>

typedef uint64_t rte_counter64_t;

void
rte_counter64_add(rte_counter64_t *counter, uint32_t val)
{
        *counter += val;
}
…       *counter = val;
}

rte_counter64_add:
         push    ebx
         mov     eax, DWORD PTR [esp+8]
         xor     ebx, ebx
         mov     ecx, DWORD PTR [esp+12]
         add     DWORD PTR [eax], ecx
         adc     DWORD PTR [eax+4], ebx
         pop     ebx
         ret

rte_counter64_read:
         mov     eax, DWORD PTR [esp+4]
         mov     edx, DWORD PTR [eax+4]
         mov     eax, DWORD PTR [eax]
         ret
rte_counter64_set:
         movq    xmm0, QWORD PTR [esp+8]
         mov     eax, DWORD PTR [esp+4]
         movq    QWORD PTR [eax], xmm0
         ret

Sure, atomic might be required on some 32 bit architectures and/or with some 
compilers.

in theory i think you should be able to use generic atomics and
depending on the target you get codegen that works. it might be
something more expensive on 32-bit and nothing on 64-bit etc..

what's the damage if we just use atomic generic and relaxed ordering? is
the codegen not optimal?

Below is what I originally proposed in the "make stats reset reliable" thread.

struct counter
{
    uint64_t count;
    uint64_t offset;
};

/../
    struct counter rx_pkts;
    struct counter rx_bytes;
/../

static uint64_t
counter_value(const struct counter *counter)
{
    uint64_t count = __atomic_load_n(&counter->count, __ATOMIC_RELAXED);
    uint64_t offset = __atomic_load_n(&counter->offset, __ATOMIC_RELAXED);

    return count - offset;
}

static void
counter_reset(struct counter *counter)
{
    uint64_t count = __atomic_load_n(&counter->count, __ATOMIC_RELAXED);

    __atomic_store_n(&counter->offset, count, __ATOMIC_RELAXED);
}

static void
counter_add(struct counter *counter, uint64_t operand)
{
__atomic_store_n(&counter->count, counter->count + operand, __ATOMIC_RELAXED);
}

I think this solution generally compiles to something that's equivalent to just using non-atomic loads/stores and hope for the best.

Using a non-atomic load in counter_add() will generate better code, but doesn't work if you using _Atomic (w/o casts).

Atomic load/stores seems to have volatile semantics, so multiple counter updates to the same counter cannot be merged. That is a drawback.

I envision a variety of 32 bit implementations, optimized for certain 
architectures/compilers.

Some of them can provide non-tearing 64 bit load/store, so we should also use 
the counter/offset design pattern for those.

Reply via email to