Possibly harmful to result interpretation esp. cross-CPUs with things like RPi/ARM: The minimum over `101` runs in your `bench` template is good to reduce noise from CPU spin up to higher clock rates, BUT the two `sleep()` s are bad since you are probably giving the CPU/OS time to put the CPU back into a lower power mode. Essentially, you are doing one thing to make results _less_ sensitive to other work happening on the system and another to make it _more_ sensitive. Unless you have a very specific workload in mind, it's usually best to pick a direction.
Your comment in that `bench` template mentions thermal throttling, but this highly local CPU benchmark (i.e. heating up at most the center part of one core) can probably fruitfully be run in mere milliseconds which should be fast enough to not overheat anything and still give many hundreds of iterations for branch predictors to warm up. So, I think that is a better direction to go here and changed the `bench` template to: template bench(scale: float; repeat,ms1,ms2: int; label: string; init,body) = var minTime = float.high for i in 1 .. repeat: init let start = getMonoTime().ticks body let finish = getMonoTime().ticks minTime = min(minTime, scale*(finish.float - start.float)) if ms1 > 0: sleep(ms1) # Maybe prevent thermal throttling. echo formatFloat(minTime, ffDecimal, 2), label if ms2 > 0: sleep(ms2) Run and then later pass in the `1.0/n` as a scale & such but maybe most importatly 0 sleeps, coalescing your highly duplicated code like: var rand: Rand res0, res1, res2, res3, res4, res5, res6: int template timeIt(it, label, res; n=10000) = stdout.write label, ": " bench(1.0/n.float, 10, 0, 0, " ns/pair"): rand = seed.initRand() do: for i in 0 .. n: var (x, y) = (rand.rand(1 .. int.high), rand.rand(1 .. int.high)) if x < y: swap x, y res = res xor it(x, y) timeIt(gcd , "gcdSL ", res0) timeIt(gcdLAR , "gcdLAR ", res1) timeIt(gcdLAR2, "gcdLAR2", res2) timeIt(gcdLAR3, "gcdLAR3", res3) timeIt(gcdLAR4, "gcdLAR4", res4) timeIt(gcdSub , "gcdSub ", res5) timeIt(gcdSub2, "gcdSub2", res6) Run With these amendments I get things that are sort of all over the map, Sub, LAR4, Sub2 and some large-ish profile-guided optimization (PGO) effects { Other backend compilers like `clang` as well as compilation modes/flags may also show interesting variation } : SkyLake_i7-6700k gcdSL : 357.67 ns/pair gcdSL : 357.72 ns/pair gcdLAR : 339.32 ns/pair gcdLAR : 339.24 ns/pair gcdLAR2: 281.59 ns/pair gcdLAR2: 281.57 ns/pair gcdLAR3: 264.93 ns/pair gcdLAR3: 264.95 ns/pair gcdLAR4: 280.62 ns/pair gcdLAR4: 280.59 ns/pair gcdSub : 179.25 ns/pair gcdSub : 180.16 ns/pair <-- gcdSub2: 260.33 ns/pair gcdSub2: 260.85 ns/pair SkyLake_i7-6700k gcc-PGO gcdSL : 333.78 ns/pair gcdSL : 333.88 ns/pair gcdLAR : 304.13 ns/pair gcdLAR : 304.07 ns/pair gcdLAR2: 256.13 ns/pair gcdLAR2: 256.16 ns/pair gcdLAR3: 240.19 ns/pair gcdLAR3: 240.29 ns/pair gcdLAR4: 256.57 ns/pair gcdLAR4: 256.57 ns/pair gcdSub : 140.40 ns/pair gcdSub : 140.56 ns/pair <-- gcdSub2: 236.12 ns/pair gcdSub2: 236.09 ns/pair AlderLake_i7-1370P - same exact binary executable gcdSL : 149.03 ns/pair gcdSL : 149.03 ns/pair gcdLAR : 164.41 ns/pair gcdLAR : 164.27 ns/pair gcdLAR2: 134.78 ns/pair gcdLAR2: 134.84 ns/pair gcdLAR3: 149.74 ns/pair gcdLAR3: 149.57 ns/pair gcdLAR4: 126.55 ns/pair gcdLAR4: 126.65 ns/pair <-- gcdSub : 149.52 ns/pair gcdSub : 149.50 ns/pair gcdSub2: 118.77 ns/pair gcdSub2: 118.95 ns/pair AlderLake_i7-1370P - new executable w/march=native gcdSL : 149.95 ns/pair gcdSL : 149.96 ns/pair gcdLAR : 165.65 ns/pair gcdLAR : 165.70 ns/pair gcdLAR2: 133.97 ns/pair gcdLAR2: 133.76 ns/pair gcdLAR3: 145.80 ns/pair gcdLAR3: 146.17 ns/pair gcdLAR4: 126.92 ns/pair gcdLAR4: 126.71 ns/pair gcdSub : 149.01 ns/pair gcdSub : 148.38 ns/pair gcdSub2: 118.38 ns/pair gcdSub2: 119.52 ns/pair <-- AlderLake_i7-1370P gcc-PGO gcdSL : 138.54 ns/pair gcdSL : 139.66 ns/pair gcdLAR : 156.41 ns/pair gcdLAR : 155.98 ns/pair gcdLAR2: 124.87 ns/pair gcdLAR2: 124.76 ns/pair gcdLAR3: 134.76 ns/pair gcdLAR3: 134.70 ns/pair gcdLAR4: 115.09 ns/pair gcdLAR4: 114.69 ns/pair gcdSub : 136.81 ns/pair gcdSub : 136.13 ns/pair gcdSub2: 107.85 ns/pair gcdSub2: 107.75 ns/pair <-- Run With shorter times to not worry about thermals, noise may be a problem. This is why I paired up two "shell up-arrow" runs above. To reduce noise, you might also try to pin the scheduler CPU to reduce noise (which I did for all my runs above, just with `taskset` and `chrt` on Linux; Linux also has `perf` to track things like CPU migrations & IPC & such). Of course to even assess if you are reducing noise, you need to measure noise (like pairing above). If all that above pairing (& reading!) seems like a pain and excessively manual to measure run-to-run variability - I agree. See [bu/tim](https://github.com/c-blake/bu/blob/main/doc/tim.md) for nicer ideas { mostly warm-up runs to drop, principled sample min extension and principled error bars; of course that page also mentions that error bars alone are inadequate due to kurtosis, but something is better than nothing }. You do need _some_ kind of indication of that variability for people reading to make principled comparisons. My guess is you just fiddled with your `101` and n` until things seemed stable, but that stability on one CPU/OS may not transfer to another and further "we" have no way to know what you did / how stable you saw. That is all just commentary about time measurement & reporting on one CPU arch since that kind of matters for these sorts of minor-refinement comparisons in play. For this kind of very small, tight arithmetic loop, I do not think you should expect differing CPUs to give you similar rankings of results for operations like this or expect "operation count" (your `gcdCount`) to translate cleanly to "real time". The version control history (or even current state) of the GMP library would probably be informative, although also likely very ugly.