On Mon, Apr 01, 2024 at 01:06:12PM +0200, Alvaro Herrera wrote: > On 2024-Mar-31, Nathan Bossart wrote: >> + popcnt = _mm512_reduce_add_epi64(accum); >> + return popcnt + pg_popcount_fast(buf, bytes); > > Hmm, doesn't this arrangement cause an extra function call to > pg_popcount_fast to be used here? Given the level of micro-optimization > being used by this code, I would have thought that you'd have tried to > avoid that. (At least, maybe avoid the call if bytes is 0, no?)
Yes, it does. I did another benchmark on very small arrays and can see the overhead. This is the time in milliseconds to run pg_popcount() on an array 1 billion times: size (bytes) HEAD AVX512-POPCNT 1 1707.685 3480.424 2 1926.694 4606.182 4 3210.412 5284.506 8 1920.703 3640.968 16 2936.91 4045.586 32 3627.956 5538.418 64 5347.213 3748.212 I suspect that anything below 64 bytes will see this regression, as that is the earliest point where there are enough bytes for ZMM registers. We could avoid the call if there are no remaining bytes, but the numbers for the smallest arrays probably wouldn't improve much, and that might actually add some overhead due to branching. The other option to avoid this overhead is to put most of pg_bitutils.c into its header file so that we can inline the call. Reviewing the current callers of pg_popcount(), IIUC the only ones that are passing very small arrays are the bit_count() implementations and a call in the syslogger for a single byte. I don't know how much to worry about the overhead for bit_count() since there's presumably a bunch of other overhead, and the syslogger one could probably be fixed via an inline function that pulled the value from pg_number_of_ones (which would probably be an improvement over the status quo, anyway). But this is all to save a couple of nanoseconds... -- Nathan Bossart Amazon Web Services: https://aws.amazon.com