Hi John, Oops this email has been sitting in my outbox for 3 days... On Wed, Mar 4, 2020 at 1:46 AM John Naylor <john.nay...@2ndquadrant.com> wrote: > On Tue, Mar 3, 2020 at 4:46 AM Jesse Zhang <sbje...@gmail.com> wrote: > > I've quickly put together a PoC patch on top of yours, which > > re-implements ceil_log2 using LZCNT coupled with a CPUID check. > > Thoughts? > > This patch seems to be making an assumption that an indirect function > call is faster than taking a branch (in inlined code) that the CPU > will almost always predict correctly. It would be nice to have some > numbers to compare. (against pg_count_leading_zeros_* using the "slow" > versions but statically inlined). >
Ah, how could I forget that... I ran a quick benchmark on my laptop, and indeed, even though the GCC-generated code takes a hit on zero input (Clang generates slightly different code that gives indistinguishable runtime for zero and non-zero inputs), the inlined code (the function input in my benchmark is never a constant literal so the branch does get exercised at runtime) is still more than twice as fast as the function call. ------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------ BM_pfunc/0 1.57 ns 1.56 ns 447127265 BM_pfunc/1 1.56 ns 1.56 ns 449618696 BM_pfunc/8 1.57 ns 1.57 ns 443013856 BM_pfunc/64 1.57 ns 1.57 ns 448784369 BM_slow/0 0.602 ns 0.600 ns 1000000000 BM_slow/1 0.391 ns 0.390 ns 1000000000 BM_slow/8 0.392 ns 0.391 ns 1000000000 BM_slow/64 0.391 ns 0.390 ns 1000000000 BM_fast/0 1.47 ns 1.46 ns 477513921 BM_fast/1 1.47 ns 1.46 ns 473992040 BM_fast/8 1.46 ns 1.46 ns 474895755 BM_fast/64 1.47 ns 1.46 ns 477215268 For your amusement, I've attached the meat of the benchmark. To build the code you can grab the repository at https://github.com/d/glowing-chainsaw/tree/pfunc > Stylistically, "8 * sizeof(num)" is a bit overly formal, since the > hard-coded number we want is in the name of the function. Oh yeah, overly generic code is indicative of the remnants of my C++ brain, will fix. Cheers, Jesse
#include "lzcnt.h" static bool go = lzcnt_available(); __attribute__((target("lzcnt"))) inline int lzcnt_fast(uint64_t word) { return __builtin_clzll(word); } static int lzcnt_choose(uint64_t word) { if (lzcnt_available()) lzcnt_pfunc = lzcnt_fast; else lzcnt_pfunc = lzcnt_slow; return lzcnt_pfunc(word); } int (*lzcnt_pfunc)(uint64_t word) = lzcnt_choose;
#ifndef LZCNT_H #define LZCNT_H #include <cpuid.h> #include <cstdint> static inline bool lzcnt_available() { uint32_t exx[4] = {0, 0, 0, 0}; __get_cpuid(0x80000001, exx + 0, exx + 1, exx + 2, exx + 3); return (exx[2] & bit_ABM) != 0; } extern int (*lzcnt_pfunc)(uint64_t word); __attribute__((target("no-lzcnt"))) inline int lzcnt_slow(uint64_t word) { if (word == 0) return 8 * sizeof(word); return __builtin_clzll(word); } int lzcnt_fast(uint64_t word); #endif
#include "benchmark/benchmark.h" #include "lzcnt.h" static void BM_pfunc(benchmark::State& state) { auto word = state.range(0); for (auto _ : state) { lzcnt_pfunc(word); } } static void BM_slow(benchmark::State& state) { auto word = state.range(0); for (auto _ : state) { benchmark::DoNotOptimize(lzcnt_slow(word)); } } static void BM_fast(benchmark::State& state) { auto word = state.range(0); for (auto _ : state) { lzcnt_fast(word); } } BENCHMARK(BM_pfunc)->Arg(0)->Range(1, 64); BENCHMARK(BM_slow)->Arg(0)->Range(1, 64); BENCHMARK(BM_fast)->Arg(0)->Range(1, 64); BENCHMARK_MAIN();