Yibo Cai created ARROW-12533:
--------------------------------

             Summary: Some benchmarks are slow on Arm64 Linux when built with 
clang
                 Key: ARROW-12533
                 URL: https://issues.apache.org/jira/browse/ARROW-12533
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Yibo Cai


Many benchmarks run very slow on Arm64 Linux when built with clang.
 Most time is spent in preparing test data, not the test itself.

Per my investigation, it boils down to poor performance of 
`std::uniform_real_distribution`, which uses software emulated `long double` 
arithmetic on Arm64 [1].

Apple M1 doesn't have this issue. Clang aarch64 sets `long double` size to 64 
bits on MacOS, but 128 on Linux [2].

Gcc aarch64 doesn't have this issue. It doesn't use `long double` to generate 
random reals [1]. Guess clang uses algorithms with better randomness.

clang `-ffast-math` option removes the `long double` arithmetic (and adds other 
simplifications to floating point arithmetic), it improves speed 100x on Arm64 
in generating random reals.

It may deserve some effort to study if `long double` is really necessary, and 
if `-ffast-math` is acceptable for generating test bits.

[1] [https://godbolt.org/z/Y3Tc6MTME]
 [2] [https://en.wikipedia.org/wiki/Long_double]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to