Faster Euclidean algorithm

demotomohiro Wed, 21 Aug 2024 23:30:26 -0700

I recently found several ways to implement Euclidean algorithm. This code has 7 
different gcd funcs implements Euclidean algorithm and measures the runtime 
speed of them.


<https://gist.github.com/demotomohiro/9680df5b8fdd48471af9378dc1fa56ad>

  * gcd: gcd copied from math module in Nim's stdlib
  * gcdLAR: simple gcd with least absolute remainder
  * gcdLAR2: gcd with least absolute remainder with many branches
  * gcdLAR3: Convert `mod` to returns remainder r such that -y/2 < r <= y/2
  * gcdLAR4: Simpler gcd with least absolute remainder
  * gcdSub: Almost the same to gcd in <https://github.com/nim-lang/bigints>
  * gcdSub2: Same to gcdSub excepts using `mod` instead of subtraction to 
reduce the number of iterations



I read this wiki article: 
<https://en.wikipedia.org/wiki/Euclidean_algorithm#Method_of_least_absolute_remainders>
 And I tried to see if Euclidean algorithm using smaller remainders is faster 
than `gcd` func in math module in Nim's stdlib.

In Nim (and most of programming languages I know), `a mod b` returns `r` such 
that `a = b * q + r (0 <= r < b)`. If `a mod b` returns `r` such that `a = b * 
q + r (-b/2 <= r < b/2)`, the number of iterations of Euclidean algorithm can 
be decreased as the absolute value of the remainders likely become smaller. But 
when remainder is already smaller than b/2 with default `mod`, it doesn't 
reduce the number of iterations.

For example, if 0 <= r < b:
    
    
    104 mod 66 = 38
    66 mod 38 = 28
    38 mod 28 = 10
    28 mod 10 = 8
    10 mod 8 = 2
    8 mod 2 = 0
    
    
    Run

if -b/2 <= r < b/2:
    
    
    104 mod 66 = -28
    66 mod -28 = 10
    -28 mod 10 = 2
    10 mod 2 = 0
    
    
    Run

My code runs each gcd funcs with random int values and count the numbers of 
iteratons by setting following consts on the top of my code:
    
    
    const
      EnableTests = false
      EnableCount = not EnableTests and true
      EnableBench = not EnableTests
    
    
    Run

Output of each counters:
    
    
    gcdCount = 36707036
    gcdLARCount = 25940133
    gcdLAR2Count = 25940133
    gcdLAR3Count = 25940133
    gcdLAR4Count = 25940133
    gcdSubCount = 43399700
    gcdSub2Count = 22028403
    
    
    Run

gcdLAR reduced the number of iterations to about 71%. And counting tailing zero 
bits and removing them (`gcdSub` and `gcdSub2` do so) reduces more iterations 
than least absolute remainders.

My code runs benchmark code with following setting:
    
    
    const
      EnableTests = false
      EnableCount = not EnableTests and false
      EnableBench = not EnableTests
    
    
    Run
    
    
    nim c -r -d:release testgcd.nim
    
    gcd in stdlib: 11068 micro second
    gcdLAR:        10558 micro second
    gcdLAR2:       10481 micro second
    gcdLAR3:       7852 micro second
    gcdLAR4:       9055 micro second
    gcdSub:        5221 micro second
    gcdSub2:       8204 micro second
    
    
    Run

On my Intel ivy bridge cpu, least absolute remainders are faster than gcd in 
stdlib but gcdSub is fastest.

But on Raspberry Pi 3 (aarch64), I got a very different result:
    
    
    gcd in stdlib: 3132 micro second
    gcdLAR:        3992 micro second
    gcdLAR2:       4049 micro second
    gcdLAR3:       5015 micro second
    gcdLAR4:       3416 micro second
    gcdSub:        3889 micro second
    gcdSub2:       3165 micro second
    
    
    Run

gcd in stdlib is fastest. gcdLARx are slower. So it seems gcdLARx and gcdSub 
are not good enough to replace gcd in stdlib.

I think this is because mod (idiv/div) is slower than other instructions on 
Intel ivy bridge CPU and reducing the number of times mod issued made it faster 
even if some instructions are added. And `gcdSub` is fastest because it doesn't 
use mod. But mod is not slower than other instructions on Raspberry Pi 3 CPU 
and added instructions on gcdLAR makes gcd slower.

Do you have any ideas to speed up Euclidean algorithm by more than my code 
(excepts using SIMD, multi-threadings or GPU to run multiple gcd 
simultaneously)? Which gcd func is fastest on your machine?

Faster Euclidean algorithm

Reply via email to