xroche wrote:

You're right -- I verified this and the vectorizer handles the loop case well.

With `-O2 -march=native` (GFNI + AVX2), the `uint64_t[4]` loop and the manually 
unrolled version both produce identical vectorized code: `vpxor ymm` + 
`vgf2p8affineqb` + `vpshufb` + `vpsadbw` + horizontal reduction. The 
`__uint256_t` version actually produces *worse* code: 4x scalar `popcntq` + 
`addl`, because the value lives in GPRs, not vector registers.

With AVX-512 VPOPCNTDQ, same story: the loop gets `vpxor ymm` + `vpopcntq ymm` 
+ `vpmovqb` + `vpsadbw` (8 instructions), while `__uint256_t` stays scalar (11 
instructions).

The 18% speedup I measured was a red herring -- scalar `popcntq` happened to be 
faster than the GFNI-based vector popcount path on the specific test CPU, not a 
real advantage of the type.

I'll remove the Hamming distance claim from the PR description. The stronger 
motivation for `__int256` is arithmetic ergonomics and performance vs 
`_BitInt(256)` (3x for add/sub/bitwise, 1.5x for division), not SIMD popcount.

https://github.com/llvm/llvm-project/pull/182733
_______________________________________________
lldb-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-commits

Reply via email to