I got a 1.76x speed up with PGO on gcc-10.2 Linux 4.7GHz Skylake (default GC): 
    
    
    120 ms -d:release
     86 ms -d:danger
     49 ms PGO
    
    
    Run

The full range of perf (120/49=2.45x) is comparable to @jrfondren's 
198/78=2.54x. So, I suspect clang PGO would be similar (I do not have a script 
set up for that, but [see here](https://forum.nim-lang.org/t/6295)).

@apardes reported a full 4.0x ratio. So, it's possible there is still 1.6x to 
be explained and/or some nim-level optimization that could be done (also 
possible diff compilation covers the gap for him). `vdivsd` showed up at the 
top of a quick profile for me. Multiplying by the reciprocal may be faster than 
dividing in `proc /(v: Vector, c: float64)`. Maybe Rust is smart enough to do 
that here? Or some other small micro-optimization type work?

Reply via email to