With the gcc backend doing "profile guided optimization" can often help
(especially with measurements to drive inlining choices). e.g., doing just
@Stefan_Salewski's command-line I get:
julia1: 84 ms (sum of pixels: 27677748)
julia2: 83 ms (sum of pixels: 27677748)
julia3: 82 ms (sum of pixels: 27677748)
Run
while doing this
nim c -d:danger --panics:on -c t.nim
gcc -O3 -flto -fprofile-generate -I/usr/lib/nim/lib ~/.cache/nim/r/t/*.c -o
pg
./pg
gcc -O3 -flto -fprofile-use -I/usr/lib/nim/lib ~/.cache/nim/r/t/*.c -o
t-final
Run
I get:
julia1: 82 ms (sum of pixels: 27677748)
julia2: 82 ms (sum of pixels: 27677748)
julia3: 82 ms (sum of pixels: 27677748)
Run
So, the PGO "flattened" the performance a bit more. In this example the PGO
speed boost was close to zero/measurement error, but I have seen as high as 2x
speed-ups for more complicated programs. So, it's worth having some little
"nim-pgo" wrapper script to automate the above if you are writing programs that
have an easy "benchmark run".