Just to close out my take on this, on the same CPU that got 31.5 ms with PGO & -ffast-math I got 29.0 ms with that Rust version 1.48.0. So, only about a 1.08x ratio and likely highly variable from CPU to CPU doesn't seem like much of a real problem. (With gdc-10.2 PGO build I got 57 ms while ldc is another LLVM, IIRC..More evidence LLVM's default non-PGO choices seem better for this benchmark.)
@apardes \- PGO is pretty easy to script and I predict once you set up such a script that you will use it a lot. I routinely get 1.5-2.0x speed-ups with gcc PGO on Nim generated C code.