Hi John,
Oops this email has been sitting in my outbox for 3 days...

On Wed, Mar 4, 2020 at 1:46 AM John Naylor <john.nay...@2ndquadrant.com> wrote:
> On Tue, Mar 3, 2020 at 4:46 AM Jesse Zhang <sbje...@gmail.com> wrote:
> > I've quickly put together a PoC patch on top of yours, which
> > re-implements ceil_log2 using LZCNT coupled with a CPUID check.
> > Thoughts?
>
> This patch seems to be making an assumption that an indirect function
> call is faster than taking a branch (in inlined code) that the CPU
> will almost always predict correctly. It would be nice to have some
> numbers to compare. (against pg_count_leading_zeros_* using the "slow"
> versions but statically inlined).
>

Ah, how could I forget that... I ran a quick benchmark on my laptop, and
indeed, even though the GCC-generated code takes a hit on zero input
(Clang generates slightly different code that gives indistinguishable
runtime for zero and non-zero inputs), the inlined code (the function
input in my benchmark is never a constant literal so the branch does get
exercised at runtime) is still more than twice as fast as the function
call.

------------------------------------------------------
Benchmark            Time             CPU   Iterations
------------------------------------------------------
BM_pfunc/0        1.57 ns         1.56 ns    447127265
BM_pfunc/1        1.56 ns         1.56 ns    449618696
BM_pfunc/8        1.57 ns         1.57 ns    443013856
BM_pfunc/64       1.57 ns         1.57 ns    448784369
BM_slow/0        0.602 ns        0.600 ns   1000000000
BM_slow/1        0.391 ns        0.390 ns   1000000000
BM_slow/8        0.392 ns        0.391 ns   1000000000
BM_slow/64       0.391 ns        0.390 ns   1000000000
BM_fast/0         1.47 ns         1.46 ns    477513921
BM_fast/1         1.47 ns         1.46 ns    473992040
BM_fast/8         1.46 ns         1.46 ns    474895755
BM_fast/64        1.47 ns         1.46 ns    477215268


For your amusement, I've attached the meat of the benchmark. To build
the code you can grab the repository at
https://github.com/d/glowing-chainsaw/tree/pfunc

> Stylistically, "8 * sizeof(num)" is a bit overly formal, since the
> hard-coded number we want is in the name of the function.

Oh yeah, overly generic code is indicative of the remnants of my C++
brain, will fix.

Cheers,
Jesse
#include "lzcnt.h"

static bool go = lzcnt_available();

__attribute__((target("lzcnt"))) inline int lzcnt_fast(uint64_t word) {
  return __builtin_clzll(word);
}

static int lzcnt_choose(uint64_t word) {
  if (lzcnt_available())
    lzcnt_pfunc = lzcnt_fast;
  else
    lzcnt_pfunc = lzcnt_slow;
  return lzcnt_pfunc(word);
}

int (*lzcnt_pfunc)(uint64_t word) = lzcnt_choose;
#ifndef LZCNT_H
#define LZCNT_H

#include <cpuid.h>
#include <cstdint>

static inline bool lzcnt_available() {
  uint32_t exx[4] = {0, 0, 0, 0};

  __get_cpuid(0x80000001, exx + 0, exx + 1, exx + 2, exx + 3);

  return (exx[2] & bit_ABM) != 0;
}

extern int (*lzcnt_pfunc)(uint64_t word);

__attribute__((target("no-lzcnt"))) inline int lzcnt_slow(uint64_t word) {
  if (word == 0)
    return 8 * sizeof(word);
  return __builtin_clzll(word);
}

int lzcnt_fast(uint64_t word);

#endif
#include "benchmark/benchmark.h"

#include "lzcnt.h"

static void BM_pfunc(benchmark::State& state) {
  auto word = state.range(0);
  for (auto _ : state) {
    lzcnt_pfunc(word);
  }
}

static void BM_slow(benchmark::State& state) {
  auto word = state.range(0);
  for (auto _ : state) {
    benchmark::DoNotOptimize(lzcnt_slow(word));
  }
}

static void BM_fast(benchmark::State& state) {
  auto word = state.range(0);
  for (auto _ : state) {
    lzcnt_fast(word);
  }
}

BENCHMARK(BM_pfunc)->Arg(0)->Range(1, 64);
BENCHMARK(BM_slow)->Arg(0)->Range(1, 64);
BENCHMARK(BM_fast)->Arg(0)->Range(1, 64);

BENCHMARK_MAIN();

Reply via email to