On Wed, 22 May 2024 12:01:12 -0700 Tyler Retzlaff <roret...@linux.microsoft.com> wrote:
> On Wed, May 22, 2024 at 07:57:01PM +0200, Morten Brørup wrote: > > > From: Stephen Hemminger [mailto:step...@networkplumber.org] > > > Sent: Wednesday, 22 May 2024 17.38 > > > > > > On Wed, 22 May 2024 10:31:39 +0200 > > > Morten Brørup <m...@smartsharesystems.com> wrote: > > > > > > > > +/* On 32 bit platform, need to use atomic to avoid load/store > > > tearing */ > > > > > +typedef RTE_ATOMIC(uint64_t) rte_counter64_t; > > > > > > > > As shown by Godbolt experiments discussed in a previous thread [2], > > > non-tearing 64 bit counters can be implemented without using atomic > > > instructions on all 32 bit architectures supported by DPDK. So we should > > > use the counter/offset design pattern for RTE_ARCH_32 too. > > > > > > > > [2]: > > > https://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35E9F433@smarts > > > erver.smartshare.dk/ > > > > > > > > > This code built with -O3 and -m32 on godbolt shows split problem. > > > > > > #include <stdint.h> > > > > > > typedef uint64_t rte_counter64_t; > > > > > > void > > > rte_counter64_add(rte_counter64_t *counter, uint32_t val) > > > { > > > *counter += val; > > > } > > > … *counter = val; > > > } > > > > > > rte_counter64_add: > > > push ebx > > > mov eax, DWORD PTR [esp+8] > > > xor ebx, ebx > > > mov ecx, DWORD PTR [esp+12] > > > add DWORD PTR [eax], ecx > > > adc DWORD PTR [eax+4], ebx > > > pop ebx > > > ret > > > > > > rte_counter64_read: > > > mov eax, DWORD PTR [esp+4] > > > mov edx, DWORD PTR [eax+4] > > > mov eax, DWORD PTR [eax] > > > ret > > > rte_counter64_set: > > > movq xmm0, QWORD PTR [esp+8] > > > mov eax, DWORD PTR [esp+4] > > > movq QWORD PTR [eax], xmm0 > > > ret > > > > Sure, atomic might be required on some 32 bit architectures and/or with > > some compilers. > > in theory i think you should be able to use generic atomics and > depending on the target you get codegen that works. it might be > something more expensive on 32-bit and nothing on 64-bit etc.. > > what's the damage if we just use atomic generic and relaxed ordering? is > the codegen not optimal? If we use atomic with relaxed memory order, then compiler for x86 still generates a locked increment in the fast path. This costs about 100 extra cycles due to cache and prefetch stall. This whole endeavor is an attempt to avoid that. PS: looking at the locked increment code for 32 bit involves locked compare exchange and potential retry. Probably don't care about performance on that platform anymore.