> -----Original Message----- > From: Jerin Jacob Kollanukkaran <jer...@marvell.com> > Sent: Wednesday, August 14, 2019 4:46 PM > To: Phil Yang (Arm Technology China) <phil.y...@arm.com>; > tho...@monjalon.net; gage.e...@intel.com; dev@dpdk.org > Cc: hemant.agra...@nxp.com; Honnappa Nagarahalli > <honnappa.nagaraha...@arm.com>; Gavin Hu (Arm Technology China) > <gavin...@arm.com>; nd <n...@arm.com> > Subject: RE: [PATCH v9 1/3] eal/arm64: add 128-bit atomic compare exchange > > > -----Original Message----- > > From: Phil Yang <phil.y...@arm.com> > > Sent: Wednesday, August 14, 2019 1:58 PM > > To: tho...@monjalon.net; Jerin Jacob Kollanukkaran > <jer...@marvell.com>; > > gage.e...@intel.com; dev@dpdk.org > > Cc: hemant.agra...@nxp.com; honnappa.nagaraha...@arm.com; > > gavin...@arm.com; n...@arm.com > > Subject: [EXT] [PATCH v9 1/3] eal/arm64: add 128-bit atomic compare > > exchange > > +#define __HAS_ACQ(mo) ((mo) != __ATOMIC_RELAXED && (mo) != > > +__ATOMIC_RELEASE) #define __HAS_RLS(mo) ((mo) == > > __ATOMIC_RELEASE || (mo) == __ATOMIC_ACQ_REL || \ > > + (mo) == __ATOMIC_SEQ_CST) > > + > > +#define __MO_LOAD(mo) (__HAS_ACQ((mo)) ? __ATOMIC_ACQUIRE : > > +__ATOMIC_RELAXED) #define __MO_STORE(mo) (__HAS_RLS((mo)) ? > > +__ATOMIC_RELEASE : __ATOMIC_RELAXED) > > + > > +#if defined(__ARM_FEATURE_ATOMICS) || > > defined(RTE_ARM_FEATURE_ATOMICS) > > +#define __ATOMIC128_CAS_OP(cas_op_name, op_string) > > \ > > +static __rte_noinline rte_int128_t > > \ > > > Could you check the cost of making it as __rte_noinline? > If it is costly, How about having two versions, one with __rte_noinline > to make compliance with arm64 procedure call standard for > old gcc and clang. > Other one without explicit register hardcoding + inline for latest > gcc
Hi Jerin, According to the stack_lf_perf_autotest, making it as __rte_noinline has no overhead on ThunderX2 with GCC 8.3. The 'Average cycles per object push/pop' numbers for __rte_noinline and __rte_always_inline versions are nearly the same. Test results : ###### Two NUMA Node ###### #### __rte_noinline #### RTE>>stack_lf_perf_autotest <snip> ### Testing using two NUMA nodes ### Average cycles per object push/pop (bulk size: 8): 24.10 Average cycles per object push/pop (bulk size: 32): 6.85 ### Testing on all 18 lcores ### Average cycles per object push/pop (bulk size: 8): 680.39 Average cycles per object push/pop (bulk size: 32): 146.38 Test OK #### __rte_always-inline #### RTE>>stack_lf_perf_autotest <snip> ### Testing using two NUMA nodes ### Average cycles per object push/pop (bulk size: 8): 24.29 Average cycles per object push/pop (bulk size: 32): 6.92 ### Testing on all 18 lcores ### Average cycles per object push/pop (bulk size: 8): 683.92 Average cycles per object push/pop (bulk size: 32): 145.11 Test OK ###### Single NUMA ###### #### __rte_always-inline #### RTE>>stack_lf_perf_autotest <snip> ### Testing on all 18 lcores ### Average cycles per object push/pop (bulk size: 8): 582.92 Average cycles per object push/pop (bulk size: 32): 125.57 Test OK #### __rte_noinline #### RTE>>stack_lf_perf_autotest <snip> ### Testing on all 18 lcores ### Average cycles per object push/pop (bulk size: 8): 537.56 Average cycles per object push/pop (bulk size: 32): 122.98 Test OK Thanks, Phil Yang > > > > +cas_op_name(rte_int128_t *dst, rte_int128_t old, > > \ > > + rte_int128_t updated) \ > > +{ > > \ > > + /* caspX instructions register pair must start from even-numbered > > + * register at operand 1. > > + * So, specify registers for local variables here. > > + */ \ > > + register uint64_t x0 __asm("x0") = (uint64_t)old.val[0]; \ > > + register uint64_t x1 __asm("x1") = (uint64_t)old.val[1]; \ > > + register uint64_t x2 __asm("x2") = (uint64_t)updated.val[0]; \ > > + register uint64_t x3 __asm("x3") = (uint64_t)updated.val[1]; \ > > + asm volatile( \ > > + op_string " %[old0], %[old1], %[upd0], %[upd1], [%[dst]]" \ > > + : [old0] "+r" (x0), \ > > + [old1] "+r" (x1) \ > > + : [upd0] "r" (x2), \ > > + [upd1] "r" (x3), \ > > + [dst] "r" (dst) \ > > + : "memory"); \ > > + old.val[0] = x0; \ > > + old.val[1] = x1; \ > > + return old; \ > > +} > > +