> > -----Original Message----- > > From: Stokes, Ian <[email protected]> > > Sent: Wednesday, June 9, 2021 4:56 PM > > To: Ferriter, Cian <[email protected]>; [email protected]; Van > > Haaren, Harry <[email protected]> > > Cc: [email protected] > > Subject: RE: [ovs-dev] [v12 14/16] dpcls-avx512: Enable avx512 vector > popcount > > instruction. > > > > > This commit enables the AVX512-VPOPCNTDQ Vector Popcount > > > instruction. This instruction is not available on every CPU > > > that supports the AVX512-F Foundation ISA, hence it is enabled > > > only when the additional VPOPCNTDQ ISA check is passed. > > > > > > The vector popcount instruction is used instead of the AVX512 > > > popcount emulation code present in the avx512 optimized DPCLS today. > > > It provides higher performance in the SIMD miniflow processing > > > as that requires the popcount to calculate the miniflow block indexes. > > > > > > Signed-off-by: Harry van Haaren <[email protected]> > > > > Thanks for the patch Harry/Cian. > > > > A few comments inline below. > > > > > > --- > > > > > > v8: Add NEWS entry. > > > --- > > > NEWS | 3 + > > > lib/dpdk.c | 1 + > > > lib/dpif-netdev-lookup-avx512-gather.c | 84 ++++++++++++++++++++------ > > > 3 files changed, 70 insertions(+), 18 deletions(-) > > > > > > diff --git a/NEWS b/NEWS > > > index c71273ddd..d04dac746 100644 > > > --- a/NEWS > > > +++ b/NEWS > > > @@ -14,6 +14,9 @@ Post-v2.15.0 > > > * Enable AVX512 optimized DPCLS to search subtables with larger > > > miniflows. > > > * Add more specialized DPCLS subtables to cover common rules, > > > enhancing > > > the lookup performance. > > > + * Enable the AVX512 DPCLS implementation to use VPOPCNT instruction > > > if the > > > + CPU supports it. This enhances performance by using the native > > > vpopcount > > > + instructions, instead of the emulated version of vpopcount. > > > - ovs-ctl: > > > * New option '--no-record-hostname' to disable hostname > > > configuration > > > in ovsdb on startup. > > > diff --git a/lib/dpdk.c b/lib/dpdk.c > > > index c883a4b8b..a9494a40f 100644 > > > --- a/lib/dpdk.c > > > +++ b/lib/dpdk.c > > > @@ -655,6 +655,7 @@ dpdk_get_cpu_has_isa(const char *arch, const char > > > *feature) > > > #if __x86_64__ > > > /* CPU flags only defined for the architecture that support it. */ > > > CHECK_CPU_FEATURE(feature, "avx512f", RTE_CPUFLAG_AVX512F); > > > + CHECK_CPU_FEATURE(feature, "avx512vpopcntdq", > > > RTE_CPUFLAG_AVX512VPOPCNTDQ); > > > CHECK_CPU_FEATURE(feature, "bmi2", RTE_CPUFLAG_BMI2); > > > #endif > > > > > > diff --git a/lib/dpif-netdev-lookup-avx512-gather.c > > > b/lib/dpif-netdev-lookup- > > > avx512-gather.c > > > index 7adf29914..c338c2fcd 100644 > > > --- a/lib/dpif-netdev-lookup-avx512-gather.c > > > +++ b/lib/dpif-netdev-lookup-avx512-gather.c > > > @@ -53,6 +53,15 @@ > > > > > > VLOG_DEFINE_THIS_MODULE(dpif_lookup_avx512_gather); > > > > > > + > > No need for the extra whitespace added above. > > Ack, can fix. > > > > > +/* Wrapper function required to enable ISA. */ > > > +static inline __m512i > > > +__attribute__((__target__("avx512vpopcntdq"))) > > > +_mm512_popcnt_epi64_wrapper(__m512i v_in) > > > +{ > > > + return _mm512_popcnt_epi64(v_in); > > > +} > > > + > > > static inline __m512i > > > _mm512_popcnt_epi64_manual(__m512i v_in) > > > { > > > @@ -126,7 +135,8 @@ avx512_blocks_gather(__m512i v_u0, /* reg of u64 > of > > > all u0 bits */ > > > __mmask64 u1_bcast_msk, /* mask of u1 lanes */ > > > const uint64_t pkt_mf_u0_pop, /* num bits in u0 of > > > pkt */ > > > __mmask64 zero_mask, /* maskz if pkt not have mf > > > bit */ > > > - __mmask64 u64_lanes_mask) /* total lane count to > > > use */ > > > + __mmask64 u64_lanes_mask, /* total lane count to > > > use */ > > > + const uint32_t use_vpop) /* use AVX512 vpopcntdq */ > > > { > > > /* Suggest to compiler to load tbl blocks ahead of gather(). */ > > > __m512i v_tbl_blocks = _mm512_maskz_loadu_epi64(u64_lanes_mask, > > > @@ -140,8 +150,15 @@ avx512_blocks_gather(__m512i v_u0, /* reg of u64 > > > of all u0 bits */ > > > tbl_mf_masks); > > > __m512i v_masks = _mm512_and_si512(v_pkt_bits, v_tbl_masks); > > > > > > - /* Manual AVX512 popcount for u64 lanes. */ > > > - __m512i v_popcnts = _mm512_popcnt_epi64_manual(v_masks); > > > + /* Calculate AVX512 popcount for u64 lanes using the native > instruction > > > + * if available, or using emulation if not available. > > > + */ > > > + __m512i v_popcnts; > > > + if (use_vpop) { > > > + v_popcnts = _mm512_popcnt_epi64_wrapper(v_masks); > > > + } else { > > > + v_popcnts = _mm512_popcnt_epi64_manual(v_masks); > > > + } > > > > > > /* Add popcounts and offset for u1 bits. */ > > > __m512i v_idx_u0_offset = _mm512_maskz_set1_epi64(u1_bcast_msk, > > > @@ -166,7 +183,8 @@ avx512_lookup_impl(struct dpcls_subtable > *subtable, > > > const struct netdev_flow_key *keys[], > > > struct dpcls_rule **rules, > > > const uint32_t bit_count_u0, > > > - const uint32_t bit_count_u1) > > > + const uint32_t bit_count_u1, > > > + const uint32_t use_vpop) > > > { > > > OVS_ALIGNED_VAR(CACHE_LINE_SIZE)uint64_t > > > block_cache[BLOCKS_CACHE_SIZE]; > > > uint32_t hashes[NETDEV_MAX_BURST]; > > > @@ -218,7 +236,8 @@ avx512_lookup_impl(struct dpcls_subtable > *subtable, > > > u1_bcast_mask, > > > pkt_mf_u0_pop, > > > zero_mask, > > > - bit_count_total_mask); > > > + bit_count_total_mask, > > > + use_vpop); > > > _mm512_storeu_si512(&block_cache[i * MF_BLOCKS_PER_PACKET], > > > v_blocks); > > > > > > if (bit_count_total > 8) { > > > @@ -239,7 +258,8 @@ avx512_lookup_impl(struct dpcls_subtable > *subtable, > > > u1_bcast_mask_gt8, > > > pkt_mf_u0_pop, > > > zero_mask_gt8, > > > - bit_count_gt8_mask); > > > + bit_count_gt8_mask, > > > + use_vpop); > > > _mm512_storeu_si512(&block_cache[(i * MF_BLOCKS_PER_PACKET) > > > + 8], > > > v_blocks_gt8); > > > } > > > @@ -288,7 +308,11 @@ avx512_lookup_impl(struct dpcls_subtable > > > *subtable, > > > return found_map; > > > } > > > > > > -/* Expand out specialized functions with U0 and U1 bit attributes. */ > > > +/* Expand out specialized functions with U0 and U1 bit attributes. As the > > > + * AVX512 vpopcnt instruction is not supported on all AVX512 capable > > > CPUs, > > > + * create two functions for each miniflow signature. This allows the > > > runtime > > > + * CPU detection in probe() to select the ideal implementation. > > > + */ > > > > I'm trying to think is there a cleaner way of implementing this rather than > having two > > functions but I'm not sure. > > > > On one hand the functions use the (mostly) same implementation except for > the > > vpop check. > > > > Was there any thoughts on just implementing the one function and having a > dynamic > > check within that? > > Or did that impact on the performance too much? > > > > On the other hand I do like the approach of the single variable vpop. > > Certainly > makes > > it clearer to myself at least of whether the instruction gets used or not > > and an > easy > > point to debug if required in the future. > > > > When selecting the vpop implementation, is it flagged to the user at any > > stage > that > > vpop will be used? > > The big part of the question here is "what will the compiler allow". > So a compiler will *not* insert the vpopcnt instruction into a function > that does not explicitly enable the instruction. > > The danger here is that if we *do* enable avx512-vpopcnt for the whole > function, > the compiler is *technically* allowed to just use the instruction regardless > of the > use_vpopcnt variable, as it could identify that the _manual() version > achieves the > same thing as the actual vpopcnt, and hence just always call vpopcnt. > > So the only way to have the compiler be happy, and get correctness, is to > ensure > that the compiler *does* have vpopcnt for one function, and *does not* have > that ISA available for the other implementation.
Understood, had a feeling there was more to this than met the eye 😊. > > There's some trickery going on with inlining functions with different ISAs, to > avoid > code-duplication in the generic code. The nice side-effect of this is that > indeed > the > function is branch-free on how it does its vpop-counting :) > > In my opinion this code is the best it can be. Regards, -Harry Agreed. Thanks for the detailed explanation. Ian _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
