On Mon, Feb 2, 2026, at 5:51 PM, Nathan Bossart wrote:
> On Mon, Feb 02, 2026 at 09:16:42PM +0700, John Naylor wrote:
>> It might be a good idea to do a little new testing, and I see a use
>> for a special 8-byte path independent of AVX512: v6 seems to regress a
>> little for single-words. But, it turns out that when gcc turns
>> __builtin_popcountl into a single instruction, it's inline, but if it
>> emits portable bitwise ops, it does so in a function called
>> __popcountdi2(). That can be avoided by hand-coding in C for normal
>> builds (and for 32-bit looks cleaner anyway), as in the attached 0005.
>
> Oh, interesting.  I looked into this a little more [0].  Both gcc and clang
> generate cnt instructions for aarch64, so we're good there.  However, clang
> on x86-64 generates the bit-twiddling version, and gcc on x86-64 generates
> a call to __popcountdi2() (which I imagine does something similar).  It's
> not until you provide a compiler flag like -march=x86-64-v2 that gcc/clang
> start generating popcnt instructions for x86-64, which makes sense.  0005
> seems like the correct move to me...
>
> [0] https://godbolt.org/z/he3WozG3E
>
> -- 
> nathan

Nathan, John,

Thanks for the focus on this area of the code.  I've been looking into what to 
do with popcnt when building Win11/ARM64/MSVC.  I know that when _MSC_VER and 
_M_ARM64 are defined we can make use of the __popcnt(unsigned int) and 
__popcnt64(unsigned __int64) intrinsics which have been available since VS 2022 
17.11+.  I thought I'd check that combo out and it turns out that it is 
identical to clang/gcc on that platform [0].

I'll wait for your work to land before proposing a patch to add these unless it 
is really easy to fit it and you feel like giving it a go. :)

best.

-greg

[0] https://godbolt.org/z/TrxjzcPGE


Reply via email to