On Tue, Jun 30, 2020 at 3:01 AM Van Haaren, Harry
<[email protected]> wrote:
>
> > -----Original Message-----
> > From: William Tu <[email protected]>
> > Sent: Saturday, June 27, 2020 7:27 PM
> > To: Van Haaren, Harry <[email protected]>
> > Cc: ovs-dev <[email protected]>; Stokes, Ian <[email protected]>;
> > Ilya Maximets <[email protected]>; Federico Iezzi <[email protected]>
> > Subject: Re: [PATCH v4 5/7] dpif-lookup: add avx512 gather implementation.
> >
> > On Thu, Jun 18, 2020 at 9:53 AM Harry van Haaren
> > <[email protected]> wrote:
> > >
> > > This commit adds an AVX-512 dpcls lookup implementation.
> > > It uses the AVX-512 SIMD ISA to perform multiple miniflow
> > > operations in parallel.
> > >
> > > To run this implementation, the "avx512f" and "bmi2" ISAs are
> > > required. These ISA checks are performed at runtime while
> > > probing the subtable implementation. If a CPU does not provide
> > > both "avx512f" and "bmi2", then this code does not execute.
> > >
> > > The avx512 code is built as a seperate static library, with added
> > > CFLAGS to enable the required ISA features. By building only this
> > > static library with avx512 enabled, it is ensured that the main OVS
> > > core library is *not* using avx512, and that OVS continues to run
> > > as before on CPUs that do not support avx512.
> > >
> > > The approach taken in this implementation is to use the
> > > gather instruction to access the packet miniflow, allowing
> > > any miniflow blocks to be loaded into an AVX-512 register.
> > > This maximises the usefulness of the register, and hence this
> > > implementation handles any subtable with up to miniflow 8 bits.
> > >
> > > Note that specialization of these avx512 lookup routines
> > > still provides performance value, as the hashing of the
> > > resulting data is performed in scalar code, and compile-time
> > > loop unrolling occurs when specialized to miniflow bits.
> > >
> > > Signed-off-by: Harry van Haaren <[email protected]>
> > >
> > > ---
> > >
> > > v4:
> > > - Remove TODO comment on prio-set command (was accidentally
> > >   added to this commit in v3)
> > > - Fixup v3 changlog to not include #warning comment (William Tu)
> > > - Remove #define for debugging in lookup.h
> > > - Fix builds on older gcc versions that don't support -mavx512f.
> > >   Solution involves CC_CHECK and #ifdefs in code (OVS Robot, William Tu)
> > >
> > > v3:
> > > - Improve function name for _any subtable lookup
> > > - Use "" include not <> for immintrin.h
> > > - Add checks for SSE42 instructions in core OVS for CRC32 based hashing
> > >   If not available, disable AVX512 lookup implementation as it requires
> > >   uses CRC32 for hashing, and the hashing algorithm must match core OVS.
> > > - Rework ovs_asserts() into function selection time check
> > > - Add #define for magic number 8, number of u64 blocks in AVX512 register
> > > - Add #if CHECKER around AVX code, sparse doesn't like checking it
> > > - Simplify avx512 enabled building, fixes builds with --enable-shared
> > > ---
> > >  configure.ac                           |   2 +
> > >  lib/automake.mk                        |  17 ++
> > >  lib/dpif-netdev-lookup-avx512-gather.c | 265 +++++++++++++++++++++++++
> > >  lib/dpif-netdev-lookup.c               |  17 ++
> > >  lib/dpif-netdev-lookup.h               |   4 +
> > >  5 files changed, 305 insertions(+)
> > >  create mode 100644 lib/dpif-netdev-lookup-avx512-gather.c
> > >
> > > diff --git a/configure.ac b/configure.ac
> > > index 81893e56e..1367c868b 100644
> > > --- a/configure.ac
> > > +++ b/configure.ac
> > > @@ -178,6 +178,8 @@ OVS_ENABLE_OPTION([-Wno-null-pointer-arithmetic])
> > >  OVS_ENABLE_OPTION([-Warray-bounds-pointer-arithmetic])
> > >  OVS_CONDITIONAL_CC_OPTION([-Wno-unused], [HAVE_WNO_UNUSED])
> > >  OVS_CONDITIONAL_CC_OPTION([-Wno-unused-parameter],
> > [HAVE_WNO_UNUSED_PARAMETER])
> > > +OVS_CONDITIONAL_CC_OPTION([-mavx512f], [HAVE_AVX512F])
> > > +OVS_CHECK_CC_OPTION([-mavx512f], [CFLAGS="$CFLAGS -DHAVE_AVX512F"])
> >
> > Do you need both checks?
> > I thought the first one OVS_CONDITIONAL_CC_OPTION([-mavx512f],
> > [HAVE_AVX512F])
> > is good enough since at lib/automake.mk, you add the -mavx512f to CFLAGS.
>
> From testing during development, both are required.
> CONDITIONAL_CC_OPTION adds a build-system flag, indicating its present, but 
> doesn't
> seem to add a C #define for it, that can be used for conditional compilation?
>
> The CHECK_CC_OPTION is used to manually add a #define via command-line -D 
> parameter, it is used to add the avx512_gather probe function in the 
> available lookup function struct.
>
> There may be a more elegant way to achieve both in the same line, my AC-fu is 
> somewhat outdated, suggestions welcome if you know of a better method :)
>
I see, thanks. I don't know any better way.

> <snip some patch contents>
>
> > > +#include "immintrin.h"
> > > +
> > > +/* Each AVX512 register (zmm register in assembly notation) can contain 
> > > up
> > to
> > > + * 512 bits, which is equivelent to 8 uint64_t variables. This is the 
> > > maximum
> >
> > typo: equivalent
>
> Will fix.
>
>
> > > + * number of miniflow blocks that can be processed in a single pass of 
> > > the
> > > + * AVX512 code at a time.
> > > + */
> > > +#define NUM_U64_IN_ZMM_REG (8)
> > > +#define BLOCKS_CACHE_SIZE (NETDEV_MAX_BURST *
> > NUM_U64_IN_ZMM_REG)
> > > +
> > > +
> > > +VLOG_DEFINE_THIS_MODULE(dpif_lookup_avx512_gather);
> > > +
> > > +static inline __m512i
> > > +_mm512_popcnt_epi64_manual(__m512i v_in)
> > > +{
> > > +    static const uint8_t pop_lut[64] = {
> > > +        0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
> > > +        0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
> > > +        0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
> > > +        0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
> > > +    };
> > > +    __m512i v_pop_lut = _mm512_loadu_si512(pop_lut);
> > > +
> > > +    __m512i v_in_srl8 = _mm512_srli_epi64(v_in, 4);
> > > +    __m512i v_nibble_mask = _mm512_set1_epi8(0xF);
> > > +    __m512i v_in_lo = _mm512_and_si512(v_in, v_nibble_mask);
> > > +    __m512i v_in_hi = _mm512_and_si512(v_in_srl8, v_nibble_mask);
> > > +
> > > +    __m512i v_lo_pop = _mm512_shuffle_epi8(v_pop_lut, v_in_lo);
> > > +    __m512i v_hi_pop = _mm512_shuffle_epi8(v_pop_lut, v_in_hi);
> > > +    __m512i v_u8_pop = _mm512_add_epi8(v_lo_pop, v_hi_pop);
> > > +
> > > +    return _mm512_sad_epu8(v_u8_pop, _mm512_setzero_si512());
> > > +}
> >
> > I forgot whether you mentioned this or not.
> > But why create this manual popcnt?
> > Isn't there a _mm512_popcnt_* in the library?
>
> To answer your question directly:
> The vector popcount instruction requires AVX512VPOPCNTDQ. Skylake does not 
> include
> the VPOPCNTDQ AVX512 extension. The "_manual" version enables the DPCLS to 
> execute
> on all AVX512 CPUs available today. In future, support for the AVX512 vector 
> popcount can
> be added with little effort.
>
> The intrinsic guide for   _mm512_popcnt_epi64()  has more details:
> https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=vpopcnt&expand=4368
>
> Note that it lists "CPUID Flags: AVX512VPOPCNTDQ", indicating a requirement 
> on that ISA level.
> It becomes available in the Ice Lake microarchitecture, more ISA details 
> available here for those interested:
> https://software.intel.com/content/www/us/en/develop/download/10th-generation-intel-core-processor-instruction-throughput-and-latency-docs.html

Thanks
William
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to