On Wed, 22 May 2024 12:01:12 -0700
Tyler Retzlaff <roret...@linux.microsoft.com> wrote:

> On Wed, May 22, 2024 at 07:57:01PM +0200, Morten Brørup wrote:
> > > From: Stephen Hemminger [mailto:step...@networkplumber.org]
> > > Sent: Wednesday, 22 May 2024 17.38
> > > 
> > > On Wed, 22 May 2024 10:31:39 +0200
> > > Morten Brørup <m...@smartsharesystems.com> wrote:
> > >   
> > > > > +/* On 32 bit platform, need to use atomic to avoid load/store  
> > > tearing */  
> > > > > +typedef RTE_ATOMIC(uint64_t) rte_counter64_t;  
> > > >
> > > > As shown by Godbolt experiments discussed in a previous thread [2],  
> > > non-tearing 64 bit counters can be implemented without using atomic
> > > instructions on all 32 bit architectures supported by DPDK. So we should
> > > use the counter/offset design pattern for RTE_ARCH_32 too.  
> > > >
> > > > [2]:  
> > > https://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35E9F433@smarts
> > > erver.smartshare.dk/
> > > 
> > > 
> > > This code built with -O3 and -m32 on godbolt shows split problem.
> > > 
> > > #include <stdint.h>
> > > 
> > > typedef uint64_t rte_counter64_t;
> > > 
> > > void
> > > rte_counter64_add(rte_counter64_t *counter, uint32_t val)
> > > {
> > >   *counter += val;
> > > }
> > > … *counter = val;
> > > }
> > > 
> > > rte_counter64_add:
> > >         push    ebx
> > >         mov     eax, DWORD PTR [esp+8]
> > >         xor     ebx, ebx
> > >         mov     ecx, DWORD PTR [esp+12]
> > >         add     DWORD PTR [eax], ecx
> > >         adc     DWORD PTR [eax+4], ebx
> > >         pop     ebx
> > >         ret
> > > 
> > > rte_counter64_read:
> > >         mov     eax, DWORD PTR [esp+4]
> > >         mov     edx, DWORD PTR [eax+4]
> > >         mov     eax, DWORD PTR [eax]
> > >         ret
> > > rte_counter64_set:
> > >         movq    xmm0, QWORD PTR [esp+8]
> > >         mov     eax, DWORD PTR [esp+4]
> > >         movq    QWORD PTR [eax], xmm0
> > >         ret  
> > 
> > Sure, atomic might be required on some 32 bit architectures and/or with 
> > some compilers.  
> 
> in theory i think you should be able to use generic atomics and
> depending on the target you get codegen that works. it might be
> something more expensive on 32-bit and nothing on 64-bit etc..
> 
> what's the damage if we just use atomic generic and relaxed ordering? is
> the codegen not optimal?

If we use atomic with relaxed memory order, then compiler for x86 still 
generates
a locked increment in the fast path. This costs about 100 extra cycles due
to cache and prefetch stall. This whole endeavor is an attempt to avoid that.

PS: looking at the locked increment code for 32 bit involves locked compare
exchange and potential retry. Probably don't care about performance on that 
platform
anymore.


Reply via email to