Re: Popcount optimization for the slow-path lookups

Nathan Bossart Fri, 05 Dec 2025 07:40:11 -0800

On Fri, Dec 05, 2025 at 03:07:07PM +0200, Andrew Pogrebnoi wrote:
> I want to propose an optimization for pg_popcount32_slow() and
> pg_popcount64_slow() where lookups into pg_number_of_ones[] are made
> branchless. It shows speedup around 58% for uint64 and 35% for uint32 words
> compared to the current, looped version. This is on x86. It is much more
> significant on Arm64 (Apple M1): ~x2.6 for uint64 and ~x2.25 for uint32
> words.
> 
> I probably have to guard the lookup in pg_popcount64_slow() with `#if
> SIZEOF_LONG == 8` like it's done for __builtin_popcountl() in the same
> function. What do you think?
> 
> For benchmarks, I used functions with the copies of the lookup loop from
> pg_popcount32/64_slow(), measuring them against branchless counterparts.
> Then in C++ I pre generated two static vectors with random uint64's and
> uint32's, each of 5M size and ran lookups using google/benchmark [0]. I've
> run it on M1 and x86 Macs.


I don't think the proposed improvements are relevant for either of the
machines you used for your benchmarks.  For x86, we've optimized our
popcount code to use SSE4.2 or AVX-512, and for AArch64, we've optimized it
to use Neon or SVE.  And for other systems, we still try to use
__builtin_popcount() and friends in the fallback paths, which IIUC are
available on both gcc and clang (and maybe elsewhere).  IMHO we need to run
the benchmarks on a compiler/architecture combination where it would
actually be used in practice. 

-- 
nathan

Re: Popcount optimization for the slow-path lookups

Reply via email to