Yibo Cai created ARROW-12533:
--------------------------------
Summary: Some benchmarks are slow on Arm64 Linux when built with
clang
Key: ARROW-12533
URL: https://issues.apache.org/jira/browse/ARROW-12533
Project: Apache Arrow
Issue Type: Improvement
Components: C++
Reporter: Yibo Cai
Many benchmarks run very slow on Arm64 Linux when built with clang.
Most time is spent in preparing test data, not the test itself.
Per my investigation, it boils down to poor performance of
`std::uniform_real_distribution`, which uses software emulated `long double`
arithmetic on Arm64 [1].
Apple M1 doesn't have this issue. Clang aarch64 sets `long double` size to 64
bits on MacOS, but 128 on Linux [2].
Gcc aarch64 doesn't have this issue. It doesn't use `long double` to generate
random reals [1]. Guess clang uses algorithms with better randomness.
clang `-ffast-math` option removes the `long double` arithmetic (and adds other
simplifications to floating point arithmetic), it improves speed 100x on Arm64
in generating random reals.
It may deserve some effort to study if `long double` is really necessary, and
if `-ffast-math` is acceptable for generating test bits.
[1] [https://godbolt.org/z/Y3Tc6MTME]
[2] [https://en.wikipedia.org/wiki/Long_double]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)