https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81616
--- Comment #18 from Andrew Roberts <andrewm.roberts at sky dot com> --- Ok trying an entirely different algorith, same results: Using Mersenne Twister algorithm from here: http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/MT2002/emt19937ar.html alter main program to comment out original test harness, and replace main with: int main(void) { int i; unsigned long init[4]={0x123, 0x234, 0x345, 0x456}, length=4; init_by_array(init, length); clock_t e, s=clock(); int j=genrand_int32(); for(i=0; i<100000000; i++) { j ^= genrand_int32(); } e=clock(); if (j != -549769613) printf("Error j != -549769613 (%d)\n", j); printf("mt19937ar took %ld clocks ", (long)(e-s)); return 0; } So nothing complicated. On Ryzen: -------- Top 5: mt19937ar took 354877 clocks -march=amdfam10 -mtune=k8 mt19937ar took 356203 clocks -march=bdver2 -mtune=eden-x2 mt19937ar took 356534 clocks -march=nano-x2 -mtune=nano-1000 mt19937ar took 357321 clocks -march=athlon-fx -mtune=nano-x4 mt19937ar took 357634 clocks -march=bdver3 -mtune=nano-x2 Bot 5: mt19937ar took 675052 clocks -march=nano -mtune=btver1 mt19937ar took 679826 clocks -march=k8 -mtune=nocona mt19937ar took 681118 clocks -march=opteron -mtune=atom mt19937ar took 689604 clocks -march=core2 -mtune=broadwell mt19937ar took 699840 clocks -march=skylake -mtune=generic Top -mtune=znver1 mt19937ar took 369722 clocks -march=nano-x2 -mtune=znver1 Top -march=znver1 mt19937ar took 375286 clocks -march=znver1 -mtune=silvermont -march=znver1 -mtune=znver1 (aka native) mt19937ar took 430875 clocks -march=znver1 -mtune=znver1 -march=haswell -mtune=haswell mt19937ar took 402963 clocks -march=haswell -mtune=haswell -march=k8 -mtune=k8 mt19937ar took 367890 clocks -march=k8 -mtune=k8 so -march=znver1 -mtune=znver1 is: 7% slower than tuning for haswell 17% slower than tuning for k8 Again -mtune=znver1, -mtune=bdverX, -mtune=btverX all cluster at the bottom On Haswell: ---------- Top 5: mt19937ar took 290000 clocks -march=amdfam10 -mtune=barcelona mt19937ar took 290000 clocks -march=amdfam10 -mtune=bdver1 mt19937ar took 290000 clocks -march=amdfam10 -mtune=bdver2 mt19937ar took 290000 clocks -march=amdfam10 -mtune=bdver3 mt19937ar took 290000 clocks -march=amdfam10 -mtune=bdver4 Bot 5: mt19937ar took 370000 clocks -march=znver1 -mtune=bdver3 mt19937ar took 370000 clocks -march=znver1 -mtune=bdver4 mt19937ar took 370000 clocks -march=znver1 -mtune=btver2 mt19937ar took 370000 clocks -march=znver1 -mtune=znver1 mt19937ar took 380000 clocks -march=knl -mtune=bdver1 Top -mtune=haswell mt19937ar took 300000 clocks -march=bdver4 -mtune=haswell Top -march=haswell mt19937ar took 300000 clocks -march=haswell -mtune=broadwell -march=haswell -mtune=haswell (aka native) mt19937ar took 300000 clocks -march=haswell -mtune=haswell Best performing pair: mt19937ar took 290000 clocks -march=barcelona -mtune=barcelona so the haswell options are pretty much optimal on that hardware as from other test.