Thanks for much for all the replies. Using clang and danger does give me close to a 2x speedup, getting it running in close to 160ms. I haven't tried PGO yet. Inline pragmas and `-ffast-math` don't seem to make any significant difference for me.
Here's my Rust implementation, as requested: <https://github.com/alexpardes/benchmarks/blob/master/rust/src/main.rs> In Rust I am building with `cargo build --release`. I also wrote a version in D after posting yesterday: <https://github.com/alexpardes/benchmarks/blob/master/d/source/app.d> When built with LDC (which is LLVM based), this runs at essentially the same speed as the Rust version (usually <75ms). Perhaps also worth noting that I have an Ivy Bridge CPU, which I assume is older than what most of you are testing with. Thanks again for taking a look at this. Hearing that eliminating the performance gap is probably just a matter of playing with compiler flags is enough to make me feel confident about Nim's speed.