For the curious it matters quite a bit to do floating point arithmetic because 
a "big" optimization in question relates to FP vectorization. There are more 
details in the @xyz32, @aedt @cblake and @oyster comments in 
[https://forum.nim-lang.org/t/1779](https://forum.nim-lang.org/t/1779) (which 
is otherwise mostly about D vs Nim). At least 3 other people reproduced that 
large 5+x performance delta from the optimization/vectorization/fewer function 
calls trick..So, it's probably not too hyper sensitive to gcc optimization 
flags, but I do think PGO builds cause this particular optimization to be 
missed.

And...UPDATE - Years later, the vectorization/call elimination optimization 
still works on gcc-10.1 and still requires the indirect call structure to work 
(for both my reduced C and, implicitly the Nim program at the top of the 
thread). clang-10, AFAIK still cannot do the optimization..(At least `clang 
-Ofast` on that reduced C program remains 10x slower than gcc.)

In my test just now the Nim version took 2x the time of the 
specialized-to-activate-the-optimization-C (best guess float32 in the C vs 
float64 in the Nim changes vectorization stride), but both are still these 
5..10x better than other compilers just because they do so much less work 
(because of the **_exponential sensitivity of Fibonacci work_** ). Why, maybe 
someone with avx512 can get gcc to do some 16x or 32x way thing going that does 
**_36x or 200x_** less work or something! If you don't like reading x64 
assembly, you can add the `gcc -pg` type profiling and use `gprof` to confirm 
how many fewer funcalls happen.

It probably is hard-to-near impossible to propagate just how fraught with 
apples-to-oranges peril the Fibonacci benchmark has become (especially to 
people like that article comparing debug to optimized builds).

Reply via email to