> -----Original Message----- > From: Nathan Bossart <[email protected]> > Sent: Thursday, March 7, 2024 1:36 PM > Subject: Re: Popcount optimization using AVX512
I will be splitting the request into 2 patches. I am attaching the first patch
(refactoring only) and I updated the commitfest entry to match this patch. I
have a question however:
Do I need to wait for the refactor patch to be merged before I post the AVX
portion of this feature in this thread?
> > + PGAC_AVX512_POPCNT_INTRINSICS([-mavx512vpopcntdq -mavx512f])
>
> I'm curious why we need both -mavx512vpopcntdq and -mavx512f. On my
> machine, -mavx512vpopcntdq alone is enough to pass this test, so if there are
> other instructions required that need -mavx512f, then we might need to
> expand the test.
First, nice catch on the required flags to build! When I changed my algorithm,
dependence on the -mavx512f flag was no longer needed, In the second patch (AVX
specific) I will fix this.
> I still think it's worth breaking this change into at least 2 patches. In
> particular,
> I think there's an opportunity to do the refactoring into pg_popcnt_choose.c
> and pg_popcnt_x86_64_accel.c prior to adding the AVX512 stuff. These
> changes are likely straightforward, and getting them out of the way early
> would make it easier to focus on the more interesting changes. IMHO there
> are a lot of moving parts in this patch.
As stated above I am doing this in 2 patches. :)
> > +#undef HAVE__GET_CPUID_COUNT
> > +
> > +/* Define to 1 if you have immintrin. */ #undef HAVE__IMMINTRIN
>
> Is this missing HAVE__CPUIDEX?
Yes I missed it, I will include in the second patch (AVX specific) of the 2
patches.
> > uint64
> > -pg_popcount(const char *buf, int bytes)
> > +pg_popcount_slow(const char *buf, int bytes)
> > {
> > uint64 popcnt = 0;
> >
> > -#if SIZEOF_VOID_P >= 8
> > +#if SIZEOF_VOID_P == 8
> > /* Process in 64-bit chunks if the buffer is aligned. */
> > if (buf == (const char *) TYPEALIGN(8, buf))
> > {
> > @@ -311,7 +224,7 @@ pg_popcount(const char *buf, int bytes)
> >
> > buf = (const char *) words;
> > }
> > -#else
> > +#elif SIZEOF_VOID_P == 4
> > /* Process in 32-bit chunks if the buffer is aligned. */
> > if (buf == (const char *) TYPEALIGN(4, buf))
> > {
>
> Apologies for harping on this, but I'm still not seeing the need for these
> SIZEOF_VOID_P changes. While it's unlikely that this makes any practical
> difference, I see no reason to more strictly check SIZEOF_VOID_P here.
I got rid of the second occurrence as I agree it is not needed but unless you
see something I don't how to know which function to call between a 32-bit and
64-bit architecture? Maybe I am missing something obvious? What exactly do you
suggest here? I am happy to always call either pg_popcount32() or
pg_popcount64() with the understanding that it may not be optimal, but I do
need to know which to use.
> > + /* Process any remaining bytes */
> > + while (bytes--)
> > + popcnt += pg_number_of_ones[(unsigned char) *buf++];
> > + return popcnt;
> > +#else
> > + return pg_popcount_slow(buf, bytes);
> > +#endif /* USE_AVX512_CODE */
>
> nitpick: Could we call pg_popcount_slow() in a common section for these
> "remaining bytes?"
Agreed, will fix in the second patch as well.
> > +#if defined(_MSC_VER)
> > + pg_popcount_indirect = pg_popcount512_fast; #else
> > + pg_popcount = pg_popcount512_fast; #endif
> These _MSC_VER sections are interesting. I'm assuming this is the
> workaround for the MSVC linking issue you mentioned above. I haven't
> looked too closely, but I wonder if the CRC32C code (see
> src/include/port/pg_crc32c.h) is doing something different to avoid this
> issue.
Using the latest master branch, I see what the needed changes are, I will
implement using PGDLLIMPORT macro in the second patch.
> Upthread, Alvaro suggested a benchmark [0] that might be useful. I scanned
> through this thread and didn't see any recent benchmark results for the latest
> form of the patch. I think it's worth verifying that we are still seeing the
> expected improvements.
I will get new benchmarks using the same process I used before (from Akash) so
I get apples to apples. These are pending completion of the second patch which
is still in progress.
Just a reminder, I asked questions above about 1) multi-part dependent patches
and, 2) What specifically to do about the SIZE_VOID_P checks. :)
Thanks,
Paul
v7-0001-Refactor-POPCNT.patch
Description: v7-0001-Refactor-POPCNT.patch
