Mostly I wrote all that to show how big an impact the "table's corner" effect 
you mention can be and to encourage moving away from anti-thermal sleep()s in 
favor of small times which require more care. Just to be clear, there are 9*9 
inner loops by 500 pairs in that last batch, so 40_500 gcd pairs which should 
be 4..14ms and plenty (naively) to warm-up, but evidently not enough and/or the 
exact way in which things warm-up varies. There are other various dynamic 
resources like μOp caches. You might enjoy the 
[sushi_roll](https://gamozolabs.github.io/metrology/2019/08/19/sushi_roll.html) 
article linked to in the first footnote of my aforementioned 
[tim.md](https://github.com/c-blake/bu/blob/main/doc/tim.md). Also, as 
mentioned, in a _specific_ workload the cold cache time (meant here "generally" 
for all dynamic CPU resources) might be the most relevant anyway.

As to "what to use", cross CPU variation hot cache seems comparable to 
algorithmic differences; I don't think `gcdSub` bad, but it all depends how 
much those last 1.4X's matter. E.g., on AlderLake, `gcdSub2` seemed "about" as 
much faster than `gcdSub2` seemed to stdlib algo on SkyLake, and that's just a 
couple ISA generations of Intel. (GMP was just an example in the general topic 
space where you will see a lot of #ifdef soup/cpuid CPU diversity switches, not 
meant to refer to this exact fixed precision algo; I agree more clean examples 
may exist.).

If you are worried about last 1.20-2X factors, though, starting with 
march=native-PGO gcc/clang builds is the low tech/effort first step which was 
seeming neglected here (but maybe that's reflective of how Nimmers compiler 
their code? - I dunno - to be clear, I definitely wasn't trying to create 
advice for Nim stdlib - more just highlight / reinforce the oft overlooked 
subtlety of the measurement situation.)

Reply via email to