On Fri, Dec 05, 2025 at 03:07:07PM +0200, Andrew Pogrebnoi wrote: > I want to propose an optimization for pg_popcount32_slow() and > pg_popcount64_slow() where lookups into pg_number_of_ones[] are made > branchless. It shows speedup around 58% for uint64 and 35% for uint32 words > compared to the current, looped version. This is on x86. It is much more > significant on Arm64 (Apple M1): ~x2.6 for uint64 and ~x2.25 for uint32 > words. > > I probably have to guard the lookup in pg_popcount64_slow() with `#if > SIZEOF_LONG == 8` like it's done for __builtin_popcountl() in the same > function. What do you think? > > For benchmarks, I used functions with the copies of the lookup loop from > pg_popcount32/64_slow(), measuring them against branchless counterparts. > Then in C++ I pre generated two static vectors with random uint64's and > uint32's, each of 5M size and ran lookups using google/benchmark [0]. I've > run it on M1 and x86 Macs.
I don't think the proposed improvements are relevant for either of the machines you used for your benchmarks. For x86, we've optimized our popcount code to use SSE4.2 or AVX-512, and for AArch64, we've optimized it to use Neon or SVE. And for other systems, we still try to use __builtin_popcount() and friends in the fallback paths, which IIUC are available on both gcc and clang (and maybe elsewhere). IMHO we need to run the benchmarks on a compiler/architecture combination where it would actually be used in practice. -- nathan
