On Sat, Jan 31, 2026 at 4:33 AM Nathan Bossart <[email protected]> wrote: > > On Fri, Jan 30, 2026 at 03:22:45PM +0700, John Naylor wrote: > > 0001 - I'm pretty sure this is comparable to HEAD if the optimized > > function is pg_popcount_sse42(). Has the AVX512 version been tested > > with 8-byte inputs? That seems to have a lot of pre- and > > post-processing involved. The inline wrapper only bypasses for 7 or > > less bytes. > > Here [0] is the latest perf data I see for the AVX-512 popcount patch, > although that's comparing to v16, which IIRC lacks a few other inlining > tricks. There's a chance the SSE4.2 version is faster at that particular > length. I'm not sure we need to worry about that, but I can do a bit of > testing if you'd like.
It might be a good idea to do a little new testing, and I see a use for a special 8-byte path independent of AVX512: v6 seems to regress a little for single-words. But, it turns out that when gcc turns __builtin_popcountl into a single instruction, it's inline, but if it emits portable bitwise ops, it does so in a function called __popcountdi2(). That can be avoided by hand-coding in C for normal builds (and for 32-bit looks cleaner anyway), as in the attached 0005. My laptop here is really too old to make decisions that are micro-architecture dependent, but with that caveat, I dusted off the popcount benchmark and added a test for counting bitmapsets (v7-0004, applies on top of v6): select drive_bms_num_members(10000000, 1); master: 13.2 ticks per call v6: 15.3 v6+v7-0005 10.8 Again, take this with a grain of salt, but 0005 seems worth looking at. -- John Naylor Amazon Web Services
v7-0005-Bypass-function-call-on-x86.patch.nocfbot
Description: Binary data
v7-0004-Test-module-for-popcount-plus-bitmapset-RDTSC.patch.nocfbot
Description: Binary data
