> -----Original Message-----
> From: William Tu <[email protected]>
> Sent: Saturday, June 27, 2020 7:27 PM
> To: Van Haaren, Harry <[email protected]>
> Cc: ovs-dev <[email protected]>; Stokes, Ian <[email protected]>;
> Ilya Maximets <[email protected]>; Federico Iezzi <[email protected]>
> Subject: Re: [PATCH v4 5/7] dpif-lookup: add avx512 gather implementation.
>
> On Thu, Jun 18, 2020 at 9:53 AM Harry van Haaren
> <[email protected]> wrote:
> >
> > This commit adds an AVX-512 dpcls lookup implementation.
> > It uses the AVX-512 SIMD ISA to perform multiple miniflow
> > operations in parallel.
> >
> > To run this implementation, the "avx512f" and "bmi2" ISAs are
> > required. These ISA checks are performed at runtime while
> > probing the subtable implementation. If a CPU does not provide
> > both "avx512f" and "bmi2", then this code does not execute.
> >
> > The avx512 code is built as a seperate static library, with added
> > CFLAGS to enable the required ISA features. By building only this
> > static library with avx512 enabled, it is ensured that the main OVS
> > core library is *not* using avx512, and that OVS continues to run
> > as before on CPUs that do not support avx512.
> >
> > The approach taken in this implementation is to use the
> > gather instruction to access the packet miniflow, allowing
> > any miniflow blocks to be loaded into an AVX-512 register.
> > This maximises the usefulness of the register, and hence this
> > implementation handles any subtable with up to miniflow 8 bits.
> >
> > Note that specialization of these avx512 lookup routines
> > still provides performance value, as the hashing of the
> > resulting data is performed in scalar code, and compile-time
> > loop unrolling occurs when specialized to miniflow bits.
> >
> > Signed-off-by: Harry van Haaren <[email protected]>
> >
> > ---
> >
> > v4:
> > - Remove TODO comment on prio-set command (was accidentally
> > added to this commit in v3)
> > - Fixup v3 changlog to not include #warning comment (William Tu)
> > - Remove #define for debugging in lookup.h
> > - Fix builds on older gcc versions that don't support -mavx512f.
> > Solution involves CC_CHECK and #ifdefs in code (OVS Robot, William Tu)
> >
> > v3:
> > - Improve function name for _any subtable lookup
> > - Use "" include not <> for immintrin.h
> > - Add checks for SSE42 instructions in core OVS for CRC32 based hashing
> > If not available, disable AVX512 lookup implementation as it requires
> > uses CRC32 for hashing, and the hashing algorithm must match core OVS.
> > - Rework ovs_asserts() into function selection time check
> > - Add #define for magic number 8, number of u64 blocks in AVX512 register
> > - Add #if CHECKER around AVX code, sparse doesn't like checking it
> > - Simplify avx512 enabled building, fixes builds with --enable-shared
> > ---
> > configure.ac | 2 +
> > lib/automake.mk | 17 ++
> > lib/dpif-netdev-lookup-avx512-gather.c | 265 +++++++++++++++++++++++++
> > lib/dpif-netdev-lookup.c | 17 ++
> > lib/dpif-netdev-lookup.h | 4 +
> > 5 files changed, 305 insertions(+)
> > create mode 100644 lib/dpif-netdev-lookup-avx512-gather.c
> >
> > diff --git a/configure.ac b/configure.ac
> > index 81893e56e..1367c868b 100644
> > --- a/configure.ac
> > +++ b/configure.ac
> > @@ -178,6 +178,8 @@ OVS_ENABLE_OPTION([-Wno-null-pointer-arithmetic])
> > OVS_ENABLE_OPTION([-Warray-bounds-pointer-arithmetic])
> > OVS_CONDITIONAL_CC_OPTION([-Wno-unused], [HAVE_WNO_UNUSED])
> > OVS_CONDITIONAL_CC_OPTION([-Wno-unused-parameter],
> [HAVE_WNO_UNUSED_PARAMETER])
> > +OVS_CONDITIONAL_CC_OPTION([-mavx512f], [HAVE_AVX512F])
> > +OVS_CHECK_CC_OPTION([-mavx512f], [CFLAGS="$CFLAGS -DHAVE_AVX512F"])
>
> Do you need both checks?
> I thought the first one OVS_CONDITIONAL_CC_OPTION([-mavx512f],
> [HAVE_AVX512F])
> is good enough since at lib/automake.mk, you add the -mavx512f to CFLAGS.
>From testing during development, both are required.
CONDITIONAL_CC_OPTION adds a build-system flag, indicating its present, but
doesn't
seem to add a C #define for it, that can be used for conditional compilation?
The CHECK_CC_OPTION is used to manually add a #define via command-line -D
parameter, it is used to add the avx512_gather probe function in the available
lookup function struct.
There may be a more elegant way to achieve both in the same line, my AC-fu is
somewhat outdated, suggestions welcome if you know of a better method :)
<snip some patch contents>
> > +#include "immintrin.h"
> > +
> > +/* Each AVX512 register (zmm register in assembly notation) can contain up
> to
> > + * 512 bits, which is equivelent to 8 uint64_t variables. This is the
> > maximum
>
> typo: equivalent
Will fix.
> > + * number of miniflow blocks that can be processed in a single pass of the
> > + * AVX512 code at a time.
> > + */
> > +#define NUM_U64_IN_ZMM_REG (8)
> > +#define BLOCKS_CACHE_SIZE (NETDEV_MAX_BURST *
> NUM_U64_IN_ZMM_REG)
> > +
> > +
> > +VLOG_DEFINE_THIS_MODULE(dpif_lookup_avx512_gather);
> > +
> > +static inline __m512i
> > +_mm512_popcnt_epi64_manual(__m512i v_in)
> > +{
> > + static const uint8_t pop_lut[64] = {
> > + 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
> > + 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
> > + 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
> > + 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
> > + };
> > + __m512i v_pop_lut = _mm512_loadu_si512(pop_lut);
> > +
> > + __m512i v_in_srl8 = _mm512_srli_epi64(v_in, 4);
> > + __m512i v_nibble_mask = _mm512_set1_epi8(0xF);
> > + __m512i v_in_lo = _mm512_and_si512(v_in, v_nibble_mask);
> > + __m512i v_in_hi = _mm512_and_si512(v_in_srl8, v_nibble_mask);
> > +
> > + __m512i v_lo_pop = _mm512_shuffle_epi8(v_pop_lut, v_in_lo);
> > + __m512i v_hi_pop = _mm512_shuffle_epi8(v_pop_lut, v_in_hi);
> > + __m512i v_u8_pop = _mm512_add_epi8(v_lo_pop, v_hi_pop);
> > +
> > + return _mm512_sad_epu8(v_u8_pop, _mm512_setzero_si512());
> > +}
>
> I forgot whether you mentioned this or not.
> But why create this manual popcnt?
> Isn't there a _mm512_popcnt_* in the library?
To answer your question directly:
The vector popcount instruction requires AVX512VPOPCNTDQ. Skylake does not
include
the VPOPCNTDQ AVX512 extension. The "_manual" version enables the DPCLS to
execute
on all AVX512 CPUs available today. In future, support for the AVX512 vector
popcount can
be added with little effort.
The intrinsic guide for _mm512_popcnt_epi64() has more details:
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=vpopcnt&expand=4368
Note that it lists "CPUID Flags: AVX512VPOPCNTDQ", indicating a requirement on
that ISA level.
It becomes available in the Ice Lake microarchitecture, more ISA details
available here for those interested:
https://software.intel.com/content/www/us/en/develop/download/10th-generation-intel-core-processor-instruction-throughput-and-latency-docs.html
> The rest looks good to me,
> Thanks
Thanks for review.
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev