I recently found several ways to implement Euclidean algorithm. This code has 7 different gcd funcs implements Euclidean algorithm and measures the runtime speed of them.
<https://gist.github.com/demotomohiro/9680df5b8fdd48471af9378dc1fa56ad> * gcd: gcd copied from math module in Nim's stdlib * gcdLAR: simple gcd with least absolute remainder * gcdLAR2: gcd with least absolute remainder with many branches * gcdLAR3: Convert `mod` to returns remainder r such that -y/2 < r <= y/2 * gcdLAR4: Simpler gcd with least absolute remainder * gcdSub: Almost the same to gcd in <https://github.com/nim-lang/bigints> * gcdSub2: Same to gcdSub excepts using `mod` instead of subtraction to reduce the number of iterations I read this wiki article: <https://en.wikipedia.org/wiki/Euclidean_algorithm#Method_of_least_absolute_remainders> And I tried to see if Euclidean algorithm using smaller remainders is faster than `gcd` func in math module in Nim's stdlib. In Nim (and most of programming languages I know), `a mod b` returns `r` such that `a = b * q + r (0 <= r < b)`. If `a mod b` returns `r` such that `a = b * q + r (-b/2 <= r < b/2)`, the number of iterations of Euclidean algorithm can be decreased as the absolute value of the remainders likely become smaller. But when remainder is already smaller than b/2 with default `mod`, it doesn't reduce the number of iterations. For example, if 0 <= r < b: 104 mod 66 = 38 66 mod 38 = 28 38 mod 28 = 10 28 mod 10 = 8 10 mod 8 = 2 8 mod 2 = 0 Run if -b/2 <= r < b/2: 104 mod 66 = -28 66 mod -28 = 10 -28 mod 10 = 2 10 mod 2 = 0 Run My code runs each gcd funcs with random int values and count the numbers of iteratons by setting following consts on the top of my code: const EnableTests = false EnableCount = not EnableTests and true EnableBench = not EnableTests Run Output of each counters: gcdCount = 36707036 gcdLARCount = 25940133 gcdLAR2Count = 25940133 gcdLAR3Count = 25940133 gcdLAR4Count = 25940133 gcdSubCount = 43399700 gcdSub2Count = 22028403 Run gcdLAR reduced the number of iterations to about 71%. And counting tailing zero bits and removing them (`gcdSub` and `gcdSub2` do so) reduces more iterations than least absolute remainders. My code runs benchmark code with following setting: const EnableTests = false EnableCount = not EnableTests and false EnableBench = not EnableTests Run nim c -r -d:release testgcd.nim gcd in stdlib: 11068 micro second gcdLAR: 10558 micro second gcdLAR2: 10481 micro second gcdLAR3: 7852 micro second gcdLAR4: 9055 micro second gcdSub: 5221 micro second gcdSub2: 8204 micro second Run On my Intel ivy bridge cpu, least absolute remainders are faster than gcd in stdlib but gcdSub is fastest. But on Raspberry Pi 3 (aarch64), I got a very different result: gcd in stdlib: 3132 micro second gcdLAR: 3992 micro second gcdLAR2: 4049 micro second gcdLAR3: 5015 micro second gcdLAR4: 3416 micro second gcdSub: 3889 micro second gcdSub2: 3165 micro second Run gcd in stdlib is fastest. gcdLARx are slower. So it seems gcdLARx and gcdSub are not good enough to replace gcd in stdlib. I think this is because mod (idiv/div) is slower than other instructions on Intel ivy bridge CPU and reducing the number of times mod issued made it faster even if some instructions are added. And `gcdSub` is fastest because it doesn't use mod. But mod is not slower than other instructions on Raspberry Pi 3 CPU and added instructions on gcdLAR makes gcd slower. Do you have any ideas to speed up Euclidean algorithm by more than my code (excepts using SIMD, multi-threadings or GPU to run multiple gcd simultaneously)? Which gcd func is fastest on your machine?