On Fri, Feb 20, 2026 at 03:21:05PM +0700, John Naylor wrote: > On Thu, Feb 5, 2026 at 4:43 AM Nathan Bossart <[email protected]> > wrote: >> Yeah, the plain C version might be marginally slower than the built-in >> version for that test, but it still seems quite a bit faster than HEAD. >> >> HEAD v8 v10 >> 40 25 29 > > (for the following, numbers are nanoseconds per call from > drive_bms_num_members()) > > Seems similar on S390X / gcc 13.3 (last week I only tested a single > bitmapword and feel don't like repeating): > > master (older): 4.1083 (call builtin) > v8: 2.8889 (inline builtin) > v10: 2.7961 (inline pure C)
Thanks for testing it. > On ppc64le / gcc 8.5, without native popcount it suffers: > > words master v14 > 1 4.5 6.5 > 2 5.8 9.7 > 64 67.9 101 > 128 143 190 > > So one up, one down among obscure platforms. There seems to be a > fairly thin case for the builtin anymore, although it's not zero. I spent some time looking at how clang/gcc compiled the plain-C version on various architectures [0], and I was pleasantly surprised to discover that at some point in recent history they started automatically converting it to special popcount instructions. I suspect that you'd see better results on ppc64le if you upgraded the compiler... [0] https://godbolt.org/z/v9vvx7E89 -- nathan
