According to following resource: > Instruction tables: Lists of instruction latencies, throughputs and > micro-operation breakdowns for Intel, AMD and VIA CPUs > <https://agner.org/optimize/instruction_tables.pdf>
CPU| Latency of idiv with 64bit register ---|--- Ivy Bridge| 28-103 Skylake| 42-95 Ice/Tiger Lake| 15 Older CPU (my Ivy Bridge and cblake's Skylake), latency of `idiv` (instruction that compute remainder) is larger compared to other instructions. So calling `idiv` fewer times makes gcd faster. gcd with LAR is faster than gcd in Nim's stdlib on older CPU because gcd with LAR runs fewer `idiv` than gcd in stdlib. And gcdSub is fastest as it doesn't use `idiv`. But on new CPU (dlesnoff's Tiger Lake and cblake's AlderLake), latency of `idiv` is not so large compared to other instructions. So gcd in stdlib is not so much slower than gcdLAR4 or gcdSub. Some gcd with LAR becomes slower than gcd in stdlib because of added branches or other instructions.