According to following resource:

> Instruction tables: Lists of instruction latencies, throughputs and 
> micro-operation breakdowns for Intel, AMD and VIA CPUs 
> <https://agner.org/optimize/instruction_tables.pdf>

CPU| Latency of idiv with 64bit register  
---|---  
Ivy Bridge| 28-103  
Skylake| 42-95  
Ice/Tiger Lake| 15  
  
Older CPU (my Ivy Bridge and cblake's Skylake), latency of `idiv` (instruction 
that compute remainder) is larger compared to other instructions. So calling 
`idiv` fewer times makes gcd faster. gcd with LAR is faster than gcd in Nim's 
stdlib on older CPU because gcd with LAR runs fewer `idiv` than gcd in stdlib. 
And gcdSub is fastest as it doesn't use `idiv`.

But on new CPU (dlesnoff's Tiger Lake and cblake's AlderLake), latency of 
`idiv` is not so large compared to other instructions. So gcd in stdlib is not 
so much slower than gcdLAR4 or gcdSub. Some gcd with LAR becomes slower than 
gcd in stdlib because of added branches or other instructions.

Reply via email to