When running benchmarks on Arm64 servers, I find some benchmarks are extremely 
slow when built with clang.
E.g., "ModeKernelNarrow<BooleanType>/1048576/10000" costs 90s to finish.
I find almost all the time is spent in generating random bits (prepare test 
data)[1], not the test itself.

Below sample code is to show the issue. Tested on Arm64 with clang-10 and 
gcc-7.5, built with -O3.
For gcc, the code finished in 0.1s. But for clang, the code finishes in 11s, 
very bad.
This issue does not happen on Apple M1, with apple clang-12 arm64 compiler.
On x86, clang random engine is also much slower than gcc built, but the gap is 
much smaller.

As std::default_random_engine is implementation defined[2], I think the 
performance (randomness, speed) is not determinate.
Maybe there are better ways to generate random bits?

[1] 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/testing/random.cc#L101-L112
[2] https://en.cppreference.com/w/cpp/numeric/random

#include <random>
int main() {
  std::default_random_engine rng(42);
  std::bernoulli_distribution d(0.25);

  int s = 0;
  for (int i = 0; i < 8 * 1024 * 1024; ++i) {
    s += d(rng);
  }

  return s;
}

Reply via email to