Faster Euclidean algorithm

cblake Thu, 22 Aug 2024 05:30:15 -0700

Possibly harmful to result interpretation esp. cross-CPUs with things like 
RPi/ARM: The minimum over `101` runs in your `bench` template is good to reduce 
noise from CPU spin up to higher clock rates, BUT the two `sleep()` s are bad 
since you are probably giving the CPU/OS time to put the CPU back into a lower 
power mode. Essentially, you are doing one thing to make results _less_ 
sensitive to other work happening on the system and another to make it _more_ 
sensitive. Unless you have a very specific workload in mind, it's usually best 
to pick a direction.


Your comment in that `bench` template mentions thermal throttling, but this 
highly local CPU benchmark (i.e. heating up at most the center part of one 
core) can probably fruitfully be run in mere milliseconds which should be fast 
enough to not overheat anything and still give many hundreds of iterations for 
branch predictors to warm up. So, I think that is a better direction to go here 
and changed the `bench` template to:
    
    
    template bench(scale: float; repeat,ms1,ms2: int; label: string; init,body) 
=
      var minTime = float.high
      for i in 1 .. repeat:
        init
        let start = getMonoTime().ticks
        body
        let finish = getMonoTime().ticks
        minTime = min(minTime, scale*(finish.float - start.float))
        if ms1 > 0: sleep(ms1)      # Maybe prevent thermal throttling.
      echo formatFloat(minTime, ffDecimal, 2), label
      if ms2 > 0: sleep(ms2)
    
    
    Run

and then later pass in the `1.0/n` as a scale & such but maybe most importatly 
0 sleeps, coalescing your highly duplicated code like:
    
    
    var
      rand: Rand
      res0, res1, res2, res3, res4, res5, res6: int
    template timeIt(it, label, res; n=10000) =
      stdout.write label, ": "
      bench(1.0/n.float, 10, 0, 0, " ns/pair"):
        rand = seed.initRand()
      do:
        for i in 0 .. n:
          var (x, y) = (rand.rand(1 .. int.high), rand.rand(1 .. int.high))
          if x < y:
            swap x, y
          res = res xor it(x, y)
    timeIt(gcd    , "gcdSL  ", res0)
    timeIt(gcdLAR , "gcdLAR ", res1)
    timeIt(gcdLAR2, "gcdLAR2", res2)
    timeIt(gcdLAR3, "gcdLAR3", res3)
    timeIt(gcdLAR4, "gcdLAR4", res4)
    timeIt(gcdSub , "gcdSub ", res5)
    timeIt(gcdSub2, "gcdSub2", res6)
    
    
    Run

With these amendments I get things that are sort of all over the map, Sub, 
LAR4, Sub2 and some large-ish profile-guided optimization (PGO) effects { Other 
backend compilers like `clang` as well as compilation modes/flags may also show 
interesting variation } :
    
    
    SkyLake_i7-6700k
    gcdSL  : 357.67 ns/pair  gcdSL  : 357.72 ns/pair
    gcdLAR : 339.32 ns/pair  gcdLAR : 339.24 ns/pair
    gcdLAR2: 281.59 ns/pair  gcdLAR2: 281.57 ns/pair
    gcdLAR3: 264.93 ns/pair  gcdLAR3: 264.95 ns/pair
    gcdLAR4: 280.62 ns/pair  gcdLAR4: 280.59 ns/pair
    gcdSub : 179.25 ns/pair  gcdSub : 180.16 ns/pair <--
    gcdSub2: 260.33 ns/pair  gcdSub2: 260.85 ns/pair
    
    SkyLake_i7-6700k gcc-PGO
    gcdSL  : 333.78 ns/pair  gcdSL  : 333.88 ns/pair
    gcdLAR : 304.13 ns/pair  gcdLAR : 304.07 ns/pair
    gcdLAR2: 256.13 ns/pair  gcdLAR2: 256.16 ns/pair
    gcdLAR3: 240.19 ns/pair  gcdLAR3: 240.29 ns/pair
    gcdLAR4: 256.57 ns/pair  gcdLAR4: 256.57 ns/pair
    gcdSub : 140.40 ns/pair  gcdSub : 140.56 ns/pair <--
    gcdSub2: 236.12 ns/pair  gcdSub2: 236.09 ns/pair
    
    AlderLake_i7-1370P - same exact binary executable
    gcdSL  : 149.03 ns/pair  gcdSL  : 149.03 ns/pair
    gcdLAR : 164.41 ns/pair  gcdLAR : 164.27 ns/pair
    gcdLAR2: 134.78 ns/pair  gcdLAR2: 134.84 ns/pair
    gcdLAR3: 149.74 ns/pair  gcdLAR3: 149.57 ns/pair
    gcdLAR4: 126.55 ns/pair  gcdLAR4: 126.65 ns/pair <--
    gcdSub : 149.52 ns/pair  gcdSub : 149.50 ns/pair
    gcdSub2: 118.77 ns/pair  gcdSub2: 118.95 ns/pair
    
    AlderLake_i7-1370P - new executable w/march=native
    gcdSL  : 149.95 ns/pair  gcdSL  : 149.96 ns/pair
    gcdLAR : 165.65 ns/pair  gcdLAR : 165.70 ns/pair
    gcdLAR2: 133.97 ns/pair  gcdLAR2: 133.76 ns/pair
    gcdLAR3: 145.80 ns/pair  gcdLAR3: 146.17 ns/pair
    gcdLAR4: 126.92 ns/pair  gcdLAR4: 126.71 ns/pair
    gcdSub : 149.01 ns/pair  gcdSub : 148.38 ns/pair
    gcdSub2: 118.38 ns/pair  gcdSub2: 119.52 ns/pair <--
    
    AlderLake_i7-1370P gcc-PGO
    gcdSL  : 138.54 ns/pair  gcdSL  : 139.66 ns/pair
    gcdLAR : 156.41 ns/pair  gcdLAR : 155.98 ns/pair
    gcdLAR2: 124.87 ns/pair  gcdLAR2: 124.76 ns/pair
    gcdLAR3: 134.76 ns/pair  gcdLAR3: 134.70 ns/pair
    gcdLAR4: 115.09 ns/pair  gcdLAR4: 114.69 ns/pair
    gcdSub : 136.81 ns/pair  gcdSub : 136.13 ns/pair
    gcdSub2: 107.85 ns/pair  gcdSub2: 107.75 ns/pair <--
    
    
    Run

With shorter times to not worry about thermals, noise may be a problem. This is 
why I paired up two "shell up-arrow" runs above. To reduce noise, you might 
also try to pin the scheduler CPU to reduce noise (which I did for all my runs 
above, just with `taskset` and `chrt` on Linux; Linux also has `perf` to track 
things like CPU migrations & IPC & such).

Of course to even assess if you are reducing noise, you need to measure noise 
(like pairing above). If all that above pairing (& reading!) seems like a pain 
and excessively manual to measure run-to-run variability - I agree. See 
[bu/tim](https://github.com/c-blake/bu/blob/main/doc/tim.md) for nicer ideas { 
mostly warm-up runs to drop, principled sample min extension and principled 
error bars; of course that page also mentions that error bars alone are 
inadequate due to kurtosis, but something is better than nothing }. You do need 
_some_ kind of indication of that variability for people reading to make 
principled comparisons. My guess is you just fiddled with your `101` and n` 
until things seemed stable, but that stability on one CPU/OS may not transfer 
to another and further "we" have no way to know what you did / how stable you 
saw.

That is all just commentary about time measurement & reporting on one CPU arch 
since that kind of matters for these sorts of minor-refinement comparisons in 
play. For this kind of very small, tight arithmetic loop, I do not think you 
should expect differing CPUs to give you similar rankings of results for 
operations like this or expect "operation count" (your `gcdCount`) to translate 
cleanly to "real time". The version control history (or even current state) of 
the GMP library would probably be informative, although also likely very ugly.

Faster Euclidean algorithm

Reply via email to