On Thu, Jan 22, 2026 at 04:50:26PM +0700, John Naylor wrote: > 1) Nowadays, the only global call sites of the word-sized functions > are select_best_grantor() and in bitmapsets. The latter calls the > word-sized functions in a loop (could be just one word). It may be > more efficient to calculate the size in bytes and call pg_popcount().
Yeah, these seem like obvious places to use pg_popcount(). Note that bms_member_index() does a final popcount on a masked version of the last word. We could swap that with pg_popcount(), too, but it might be slower than just calling the word-sized function. However, it could be hard to tell the difference, as we'd be trading a function or function pointer call with an inlined loop over pg_number_of_ones. And even if it is slower, I'm not sure it matters all that much in the grand scheme of things. In any case, 0001 gets the easy ones out of the way. > Then we could get rid of all the pointer indirection for the > word-sized functions. Do you mean that we'd just keep the portable ones around? I see some code in pgvector that might be negatively impacted by that, but if I understand correctly it would require an unusual setup. > 2) The x86 byte buffer variants expend a lot of effort to detect > whether the buffer is aligned on both 64- and 32-bit platforms, with > an optimized path for each. At least 64-bit doesn't care about > alignment, and 32-bit doesn't warrant anything fancier than pure C. > Simultaneously, the aarch64 equivalent doesn't seem to take care about > alignment. (I think Nathan mentioned he didn't see a difference during > testing, but I wonder how universal that is). 0002 makes these changes for pg_popcount_sse42() and pg_popcount_masked_sse42(). It does seem strange to prefer a loop over pg_number_of_ones instead of using POPCNTQ when unaligned, but perhaps it's worth testing. I do recall the alignment stuff in the AVX-512 code showing benefits in tests because it avoids double-load overhead, so I think we should keep that for now. > 3) There is repeated code for the <8 bytes case, and the tail of the > "optimized" functions. I'm also not sure why the small case is inlined > everywhere. This is intended to help avoid function call and SIMD instruction overhead when it doesn't make sense to take it. I recall this showing a rather big difference in benchmarks when we were working on the AVX-512 versions. Regarding the duplicated code, sure, we could add some static inline functions or something. I think the only reason I haven't done so is because it's ~2 lines of code. -- nathan
>From f0cfa4c535b6658bc6e5f63c068e606f28ef9acd Mon Sep 17 00:00:00 2001 From: Nathan Bossart <[email protected]> Date: Thu, 22 Jan 2026 11:16:09 -0600 Subject: [PATCH v3 1/2] Make use of pg_popcount() in more places. --- src/backend/nodes/bitmapset.c | 27 ++++----------------------- src/include/lib/radixtree.h | 4 ++-- 2 files changed, 6 insertions(+), 25 deletions(-) diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c index a4765876c31..23c91fdb6c9 100644 --- a/src/backend/nodes/bitmapset.c +++ b/src/backend/nodes/bitmapset.c @@ -553,14 +553,8 @@ bms_member_index(Bitmapset *a, int x) bitnum = BITNUM(x); /* count bits in preceding words */ - for (int i = 0; i < wordnum; i++) - { - bitmapword w = a->words[i]; - - /* No need to count the bits in a zero word */ - if (w != 0) - result += bmw_popcount(w); - } + result += pg_popcount((const char *) a->words, + wordnum * sizeof(bitmapword)); /* * Now add bits of the last word, but only those before the item. We can @@ -749,26 +743,13 @@ bms_get_singleton_member(const Bitmapset *a, int *member) int bms_num_members(const Bitmapset *a) { - int result = 0; - int nwords; - int wordnum; - Assert(bms_is_valid_set(a)); if (a == NULL) return 0; - nwords = a->nwords; - wordnum = 0; - do - { - bitmapword w = a->words[wordnum]; - - /* No need to count the bits in a zero word */ - if (w != 0) - result += bmw_popcount(w); - } while (++wordnum < nwords); - return result; + return pg_popcount((const char *) a->words, + a->nwords * sizeof(bitmapword)); } /* diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h index b223ce10a2d..1425654a67c 100644 --- a/src/include/lib/radixtree.h +++ b/src/include/lib/radixtree.h @@ -2725,8 +2725,8 @@ RT_VERIFY_NODE(RT_NODE * node) /* RT_DUMP_NODE(node); */ - for (int i = 0; i < RT_BM_IDX(RT_NODE_MAX_SLOTS); i++) - cnt += bmw_popcount(n256->isset[i]); + cnt += pg_popcount((const char *) n256->isset, + RT_NODE_MAX_SLOTS / BITS_PER_BYTE); /* * Check if the number of used chunk matches, accounting for -- 2.50.1 (Apple Git-155)
>From aa70ca356ce21cb1a3861a9f1c42ca9485b2dbfd Mon Sep 17 00:00:00 2001 From: Nathan Bossart <[email protected]> Date: Thu, 22 Jan 2026 11:33:56 -0600 Subject: [PATCH v3 2/2] Remove unnecessary 32-bit optimizations and alignment checks. --- src/port/pg_popcount_x86.c | 63 ++++++++------------------------------ 1 file changed, 13 insertions(+), 50 deletions(-) diff --git a/src/port/pg_popcount_x86.c b/src/port/pg_popcount_x86.c index 245f0167d00..0e98f532552 100644 --- a/src/port/pg_popcount_x86.c +++ b/src/port/pg_popcount_x86.c @@ -382,33 +382,16 @@ pg_popcount_sse42(const char *buf, int bytes) uint64 popcnt = 0; #if SIZEOF_VOID_P >= 8 - /* Process in 64-bit chunks if the buffer is aligned. */ - if (buf == (const char *) TYPEALIGN(8, buf)) - { - const uint64 *words = (const uint64 *) buf; - - while (bytes >= 8) - { - popcnt += pg_popcount64_sse42(*words++); - bytes -= 8; - } + /* Process in 64-bit chunks. */ + const uint64 *words = (const uint64 *) buf; - buf = (const char *) words; - } -#else - /* Process in 32-bit chunks if the buffer is aligned. */ - if (buf == (const char *) TYPEALIGN(4, buf)) + while (bytes >= 8) { - const uint32 *words = (const uint32 *) buf; - - while (bytes >= 4) - { - popcnt += pg_popcount32_sse42(*words++); - bytes -= 4; - } - - buf = (const char *) words; + popcnt += pg_popcount64_sse42(*words++); + bytes -= 8; } + + buf = (const char *) words; #endif /* Process any remaining bytes */ @@ -428,37 +411,17 @@ pg_popcount_masked_sse42(const char *buf, int bytes, bits8 mask) uint64 popcnt = 0; #if SIZEOF_VOID_P >= 8 - /* Process in 64-bit chunks if the buffer is aligned */ + /* Process in 64-bit chunks. */ uint64 maskv = ~UINT64CONST(0) / 0xFF * mask; + const uint64 *words = (const uint64 *) buf; - if (buf == (const char *) TYPEALIGN(8, buf)) + while (bytes >= 8) { - const uint64 *words = (const uint64 *) buf; - - while (bytes >= 8) - { - popcnt += pg_popcount64_sse42(*words++ & maskv); - bytes -= 8; - } - - buf = (const char *) words; + popcnt += pg_popcount64_sse42(*words++ & maskv); + bytes -= 8; } -#else - /* Process in 32-bit chunks if the buffer is aligned. */ - uint32 maskv = ~((uint32) 0) / 0xFF * mask; - if (buf == (const char *) TYPEALIGN(4, buf)) - { - const uint32 *words = (const uint32 *) buf; - - while (bytes >= 4) - { - popcnt += pg_popcount32_sse42(*words++ & maskv); - bytes -= 4; - } - - buf = (const char *) words; - } + buf = (const char *) words; #endif /* Process any remaining bytes */ -- 2.50.1 (Apple Git-155)
