[
https://issues.apache.org/jira/browse/ARROW-12533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yibo Cai updated ARROW-12533:
-----------------------------
Description:
Many benchmarks run very slow on Arm64 Linux when built with clang.
Most time is spent in preparing test data, not the test itself.
Per my investigation, it boils down to poor performance of
`std::uniform_real_distribution`, which uses software emulated `long double`
arithmetic on Arm64 [1].
Apple M1 doesn't have this issue. Clang aarch64 sets `long double` size to 64
bits on MacOS, but 128 on Linux [2].
Gcc aarch64 doesn't have this issue. It doesn't use `long double` to generate
random reals [1].
clang `-ffast-math` option removes the `long double` arithmetic (and adds other
simplifications to floating point arithmetic), it improves speed 100x on Arm64
in generating random reals.
It may deserve some effort to study if `long double` is really necessary, and
if `-ffast-math` is acceptable for generating test bits.
[1] [https://godbolt.org/z/Y3Tc6MTME]
[2] [https://en.wikipedia.org/wiki/Long_double]
was:
Many benchmarks run very slow on Arm64 Linux when built with clang.
Most time is spent in preparing test data, not the test itself.
Per my investigation, it boils down to poor performance of
`std::uniform_real_distribution`, which uses software emulated `long double`
arithmetic on Arm64 [1].
Apple M1 doesn't have this issue. Clang aarch64 sets `long double` size to 64
bits on MacOS, but 128 on Linux [2].
Gcc aarch64 doesn't have this issue. It doesn't use `long double` to generate
random reals [1]. Guess clang uses algorithms with better randomness.
clang `-ffast-math` option removes the `long double` arithmetic (and adds other
simplifications to floating point arithmetic), it improves speed 100x on Arm64
in generating random reals.
It may deserve some effort to study if `long double` is really necessary, and
if `-ffast-math` is acceptable for generating test bits.
[1] [https://godbolt.org/z/Y3Tc6MTME]
[2] [https://en.wikipedia.org/wiki/Long_double]
> [C++] Random real generator is slow on Arm64 Linux when built with clang
> ------------------------------------------------------------------------
>
> Key: ARROW-12533
> URL: https://issues.apache.org/jira/browse/ARROW-12533
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Yibo Cai
> Priority: Minor
>
> Many benchmarks run very slow on Arm64 Linux when built with clang.
> Most time is spent in preparing test data, not the test itself.
> Per my investigation, it boils down to poor performance of
> `std::uniform_real_distribution`, which uses software emulated `long double`
> arithmetic on Arm64 [1].
> Apple M1 doesn't have this issue. Clang aarch64 sets `long double` size to 64
> bits on MacOS, but 128 on Linux [2].
> Gcc aarch64 doesn't have this issue. It doesn't use `long double` to generate
> random reals [1].
> clang `-ffast-math` option removes the `long double` arithmetic (and adds
> other simplifications to floating point arithmetic), it improves speed 100x
> on Arm64 in generating random reals.
> It may deserve some effort to study if `long double` is really necessary, and
> if `-ffast-math` is acceptable for generating test bits.
> [1] [https://godbolt.org/z/Y3Tc6MTME]
> [2] [https://en.wikipedia.org/wiki/Long_double]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)