For the curious it matters quite a bit to do floating point arithmetic because a "big" optimization in question relates to FP vectorization. There are more details in the @xyz32, @aedt @cblake and @oyster comments in [https://forum.nim-lang.org/t/1779](https://forum.nim-lang.org/t/1779) (which is otherwise mostly about D vs Nim). At least 3 other people reproduced that large 5+x performance delta from the optimization/vectorization/fewer function calls trick..So, it's probably not too hyper sensitive to gcc optimization flags, but I do think PGO builds cause this particular optimization to be missed.
And...UPDATE - Years later, the vectorization/call elimination optimization still works on gcc-10.1 and still requires the indirect call structure to work (for both my reduced C and, implicitly the Nim program at the top of the thread). clang-10, AFAIK still cannot do the optimization..(At least `clang -Ofast` on that reduced C program remains 10x slower than gcc.) In my test just now the Nim version took 2x the time of the specialized-to-activate-the-optimization-C (best guess float32 in the C vs float64 in the Nim changes vectorization stride), but both are still these 5..10x better than other compilers just because they do so much less work (because of the **_exponential sensitivity of Fibonacci work_** ). Why, maybe someone with avx512 can get gcc to do some 16x or 32x way thing going that does **_36x or 200x_** less work or something! If you don't like reading x64 assembly, you can add the `gcc -pg` type profiling and use `gprof` to confirm how many fewer funcalls happen. It probably is hard-to-near impossible to propagate just how fraught with apples-to-oranges peril the Fibonacci benchmark has become (especially to people like that article comparing debug to optimized builds).