Mostly I wrote all that to show how big an impact the "table's corner" effect you mention can be and to encourage moving away from anti-thermal sleep()s in favor of small times which require more care. Just to be clear, there are 9*9 inner loops by 500 pairs in that last batch, so 40_500 gcd pairs which should be 4..14ms and plenty (naively) to warm-up, but evidently not enough and/or the exact way in which things warm-up varies. There are other various dynamic resources like μOp caches. You might enjoy the [sushi_roll](https://gamozolabs.github.io/metrology/2019/08/19/sushi_roll.html) article linked to in the first footnote of my aforementioned [tim.md](https://github.com/c-blake/bu/blob/main/doc/tim.md). Also, as mentioned, in a _specific_ workload the cold cache time (meant here "generally" for all dynamic CPU resources) might be the most relevant anyway.
As to "what to use", cross CPU variation hot cache seems comparable to algorithmic differences; I don't think `gcdSub` bad, but it all depends how much those last 1.4X's matter. E.g., on AlderLake, `gcdSub2` seemed "about" as much faster than `gcdSub2` seemed to stdlib algo on SkyLake, and that's just a couple ISA generations of Intel. (GMP was just an example in the general topic space where you will see a lot of #ifdef soup/cpuid CPU diversity switches, not meant to refer to this exact fixed precision algo; I agree more clean examples may exist.). If you are worried about last 1.20-2X factors, though, starting with march=native-PGO gcc/clang builds is the low tech/effort first step which was seeming neglected here (but maybe that's reflective of how Nimmers compiler their code? - I dunno - to be clear, I definitely wasn't trying to create advice for Nim stdlib - more just highlight / reinforce the oft overlooked subtlety of the measurement situation.)