Hi Konstantin,
>-----Original Message----- >From: Ananyev, Konstantin <konstantin.anan...@intel.com> >Sent: Thursday 8 October 2020 15:40 >To: Medvedkin, Vladimir <vladimir.medved...@intel.com>; Power, Ciara ><ciara.po...@intel.com>; dev@dpdk.org >Cc: Richardson, Bruce <bruce.richard...@intel.com>; Jerin Jacob ><jer...@marvell.com>; Ruifeng Wang <ruifeng.w...@arm.com> >Subject: RE: [dpdk-dev] [PATCH v3 18/18] lpm: choose vector path at runtime > >> >> Hi Ciara, >> >> >> On 30/09/2020 14:04, Ciara Power wrote: >> > When choosing the vector path, max SIMD bitwidth is now checked to >> > ensure a vector path is allowable. To do this, rather than the >> > vector lookup functions being called directly from apps, a generic >> > lookup function is called which will call the vector functions if suitable. >> > >> > Signed-off-by: Ciara Power <ciara.po...@intel.com> >> > --- >> > lib/librte_lpm/rte_lpm.h | 57 ++++++++++++++++++++++++++------ >> > lib/librte_lpm/rte_lpm_altivec.h | 2 +- >> > lib/librte_lpm/rte_lpm_neon.h | 2 +- >> > lib/librte_lpm/rte_lpm_sse.h | 2 +- >> > 4 files changed, 50 insertions(+), 13 deletions(-) >> > >> > diff --git a/lib/librte_lpm/rte_lpm.h b/lib/librte_lpm/rte_lpm.h >> > index 03da2d37e0..edba7cafd5 100644 >> > --- a/lib/librte_lpm/rte_lpm.h >> > +++ b/lib/librte_lpm/rte_lpm.h >> > @@ -397,8 +397,18 @@ rte_lpm_lookup_bulk_func(const struct rte_lpm >*lpm, const uint32_t *ips, >> > /* Mask four results. */ >> > #define RTE_LPM_MASKX4_RES UINT64_C(0x00ffffff00ffffff) >> > >> > +#if defined(RTE_ARCH_ARM) || defined(RTE_ARCH_ARM64) #include >> > +"rte_lpm_neon.h" >> > +#elif defined(RTE_ARCH_PPC_64) >> > +#include "rte_lpm_altivec.h" >> > +#else >> > +#include "rte_lpm_sse.h" >> > +#endif >> > + >> > /** >> > - * Lookup four IP addresses in an LPM table. >> > + * Lookup four IP addresses in an LPM table individually by calling >> > + the >> > + * lookup function for each ip. This is used when lookupx4 is >> > + called but >> > + * the vector path is not suitable. >> > * >> > * @param lpm >> > * LPM object handle >> > @@ -417,16 +427,43 @@ rte_lpm_lookup_bulk_func(const struct >rte_lpm *lpm, const uint32_t *ips, >> > * if lookup would fail. >> > */ >> > static inline void >> > -rte_lpm_lookupx4(const struct rte_lpm *lpm, xmm_t ip, uint32_t hop[4], >> > - uint32_t defv); >> > +rte_lpm_lookupx4_scalar(struct rte_lpm *lpm, xmm_t ip, uint32_t >hop[4], >> > + uint32_t defv) >> > +{ >> > + int i; >> > + for (i = 0; i < 4; i++) >> > + if (rte_lpm_lookup(lpm, ((uint32_t *) &ip)[i], &hop[i]) < 0) >> > + hop[i] = defv; /* lookupx4 expected to set on failure >*/ } >> > >> > -#if defined(RTE_ARCH_ARM) || defined(RTE_ARCH_ARM64) -#include >> > "rte_lpm_neon.h" >> > -#elif defined(RTE_ARCH_PPC_64) >> > -#include "rte_lpm_altivec.h" >> > -#else >> > -#include "rte_lpm_sse.h" >> > -#endif >> > +/** >> > + * Lookup four IP addresses in an LPM table. >> > + * >> > + * @param lpm >> > + * LPM object handle >> > + * @param ip >> > + * Four IPs to be looked up in the LPM table >> > + * @param hop >> > + * Next hop of the most specific rule found for IP (valid on lookup hit >only). >> > + * This is an 4 elements array of two byte values. >> > + * If the lookup was successful for the given IP, then least significant >byte >> > + * of the corresponding element is the actual next hop and the most >> > + * significant byte is zero. >> > + * If the lookup for the given IP failed, then corresponding element >would >> > + * contain default value, see description of then next parameter. >> > + * @param defv >> > + * Default value to populate into corresponding element of hop[] array, >> > + * if lookup would fail. >> > + */ >> > +static inline void >> > +rte_lpm_lookupx4(struct rte_lpm *lpm, xmm_t ip, uint32_t hop[4], >> > + uint32_t defv) >> > +{ >> > + if (rte_get_max_simd_bitwidth() >= RTE_MAX_128_SIMD) >> > + rte_lpm_lookupx4_vec(lpm, ip, hop, defv); >> > + else >> > + rte_lpm_lookupx4_scalar(lpm, ip, hop, defv); } >> >> I'm afraid this will lead to a drop in performance. rte_lpm_lookupx4 >> is used in the hot path, and a bulk size is too small to amortize the >> cost of adding this extra logic. > >I do share Vladimir's concern regarding performance here. >As I said in other mail - it seems not much point to insert these checks into >inline SSE specific function, as SSE is enabled by default for all x86 builds. > The performance impact is quite small, thanks Vladimir for providing these results: before patches: LPM LookupX4: 25.1 cycles (fails = 12.5%) LPM LookupX4: 25.2 cycles (fails = 12.5%) LPM LookupX4: 25.2 cycles (fails = 12.5%) v3: LPM LookupX4: 26.2 cycles (fails = 12.5%) LPM LookupX4: 26.2 cycles (fails = 12.5%) LPM LookupX4: 26.2 cycles (fails = 12.5%) v4: Note: I haven't sent this publicly yet, modified v3 slightly to check the bitwidth in LPM create and set a flag that is used in lookupx4 to choose either vector or scalar function. LPM LookupX4: 25.5 cycles (fails = 12.5%) LPM LookupX4: 25.5 cycles (fails = 12.5%) LPM LookupX4: 25.5 cycles (fails = 12.5%) Thanks, Ciara >As another more generic thought - might be better to avoid these checks in >other public SIMD-specific inline functions (if any). >If such function get called from some .c, then at least such SIMD ISA is >already enabled for that .c file and I think this check should be >left for caller to do. > >> > >> > #ifdef __cplusplus >> > } >> > diff --git a/lib/librte_lpm/rte_lpm_altivec.h >> > b/lib/librte_lpm/rte_lpm_altivec.h >> > index 228c41b38e..82142d3351 100644 >> > --- a/lib/librte_lpm/rte_lpm_altivec.h >> > +++ b/lib/librte_lpm/rte_lpm_altivec.h >> > @@ -16,7 +16,7 @@ extern "C" { >> > #endif >> > >> > static inline void >> > -rte_lpm_lookupx4(const struct rte_lpm *lpm, xmm_t ip, uint32_t >> > hop[4], >> > +rte_lpm_lookupx4_vec(const struct rte_lpm *lpm, xmm_t ip, uint32_t >> > +hop[4], >> > uint32_t defv) >> > { >> > vector signed int i24; >> > diff --git a/lib/librte_lpm/rte_lpm_neon.h >> > b/lib/librte_lpm/rte_lpm_neon.h index 6c131d3125..14b184515d 100644 >> > --- a/lib/librte_lpm/rte_lpm_neon.h >> > +++ b/lib/librte_lpm/rte_lpm_neon.h >> > @@ -16,7 +16,7 @@ extern "C" { >> > #endif >> > >> > static inline void >> > -rte_lpm_lookupx4(const struct rte_lpm *lpm, xmm_t ip, uint32_t >> > hop[4], >> > +rte_lpm_lookupx4_vec(const struct rte_lpm *lpm, xmm_t ip, uint32_t >> > +hop[4], >> > uint32_t defv) >> > { >> > uint32x4_t i24; >> > diff --git a/lib/librte_lpm/rte_lpm_sse.h >> > b/lib/librte_lpm/rte_lpm_sse.h index 44770b6ff8..cb5477c6cf 100644 >> > --- a/lib/librte_lpm/rte_lpm_sse.h >> > +++ b/lib/librte_lpm/rte_lpm_sse.h >> > @@ -15,7 +15,7 @@ extern "C" { >> > #endif >> > >> > static inline void >> > -rte_lpm_lookupx4(const struct rte_lpm *lpm, xmm_t ip, uint32_t >> > hop[4], >> > +rte_lpm_lookupx4_vec(const struct rte_lpm *lpm, xmm_t ip, uint32_t >> > +hop[4], >> > uint32_t defv) >> > { >> > __m128i i24; >> > >> >> -- >> Regards, >> Vladimir