For reference I adapted that Araq benchmark to monitor run-to-run variability: 
    
    
    import times, strutils, stats, cligen
    
    type Msg = seq[byte]
    type Buffer = seq[Msg]
    
    proc mkMessage(n: int): Msg =
      result = newSeq[byte](1024)
      for i in 0 ..< result.len:
        result[i] = byte(n)
    
    proc pushMsg(b: var Buffer, highID: int,
                 worst: var float, window: int) =
      let t0 = epochTime() * 1e9
      
      let m = mkMessage(highID)
      when not defined(gcDestructors):
        shallowCopy(b[highID mod window], m)
      else:
        # new runtime can move this, no need for 'shallowCopy'
        b[highID mod window] = m
      
      #GC_step(500) # 0.5ms
      worst = max(worst, epochTime()*1e9 - t0)
    
    proc gcLat*(sizeKiB=1024, nMsg=1_000_000, avgN=5,
                minN=10, window=200_000, count=1_000_000) =
      proc fmt(x=0.0, n=3): auto=formatFloat(x,ffDecimal,n)
      #GC_disable()
      when not defined(gcDestructors):
        when declared(GC_setMaxPause):
          GC_setMaxPause(100)
        GC_disableMarkAndSweep()
      
      var dtMins: RunningStat
      for avgIt in 1..avgN:
        var dtMin = float.high
        for minIt in 1..minN:
          
          var worst = 0.0
          var b = newSeq[Msg](window)
          for i in 0 ..< count:
            pushMsg(b, i, worst, window)
          
          dtMin = min(dtMin, worst)
        dtMins.push dtMin
      echo "min(", minN, ")Worst(", count, "): ",
           fmt(dtMins.mean), " +- ",
           fmt(dtMins.standardDeviationS), " ns"
      echo "min(", avgN, ")min(", minN, ")Worst(", count,
           "): ", fmt(dtMins.min), " ns"
      echo GC_getStatistics()
    
    dispatch(gcLat)
    
    
    Run

I was careful to make sure `opt:speed` was on explicitly as my nim.cfg seemed 
to have d:release not => opt:speed, but d:danger did. Indeed, what I ran was: 
    
    
    nim c --verbosity:2 -d:release --cc:gcc --opt:speed --gc:arc 
--out:gcLat-release gcLat.nim
    
    
    Run

and similarly with a `-d:release -d:danger` to be visually sure from the 
command lines that gcc with -O3 was running.

Then I got these results. I had to actually run it a few times to get standard 
deviations less than 50% of the measured number. (Basically, a **_lot_** of 
things on any multi-tasking system take 3 microseconds here and there and many 
of them even happen semi-regularly and all hope in the reliability of these 
numbers tracks to that "semi-" and/or desynchronization of those things with 
the benchmark.) 
    
    
    $ ./gcLat-danger
    min(10)Worst(1000000): 2921.769 +- 45.925 ns
    min(5)min(10)Worst(1000000): 2857.093 ns
    [GC] total memory: 284450816
    [GC] occupied memory: 214422056
    
    $ ./gcLat-release
    min(10)Worst(1000000): 2973.351 +- 155.380 ns
    min(5)min(10)Worst(1000000): 2744.125 ns
    [GC] total memory: 284450816
    [GC] occupied memory: 214422056
    
    
    Run

In this test anyway, danger mode was only 2857./2744=1.04X faster for 
`--gc:arc`. It's best to use the min(min(worst)) to compare. In this case that 
mean +- sdev as more a tool to decide how well you suppressed system noise. I 
definitely confirmed it's not easy to suppress that noise on my mostly idle 
system. If you can reboot into single user mode and do that `chrt` stuff then 
you may have to try less/use fewer trials to get consistent behavior, and 
`isolcpus` even more so.

Also, 3 microseconds seems pretty good..equivalent to about 40 DIMM accesses on 
the machine I was testing on. I also timed the old `-d:useRealTimeGC` modes 
(without and with the max pause) but in my multi-user/network/on/firefox 
running set up I got much more inconsistent numbers: 
    
    
    $ ./gcLat-releaseRT
    min(10)Worst(1000000): 47747507.585 +- 90770361.271 ns
    min(5)min(10)Worst(1000000): 9039.414 ns
    [GC] total memory: 68478275584
    [GC] occupied memory: 52880335344
    [GC] collections: 0
    [GC] max threshold: 0
    [GC] freed objects: 0
    [GC] max stack size: 0
    
    $ ./gcLat-dangerRT
    min(10)Worst(1000000): 98322406.115 +- 207979237.712 ns
    min(5)min(10)Worst(1000000): 14035.128 ns
    [GC] total memory: 68478275584
    [GC] occupied memory: 52880335344
    [GC] collections: 0
    [GC] max threshold: 0
    [GC] freed objects: 0
    [GC] max stack size: 0
    
    $ ./gcLat-dangerRTmp
    min(10)Worst(1000000): 124940125.135 +- 188809651.190 ns
    min(5)min(10)Worst(1000000): 5578.240 ns
    [GC] total memory: 68478275584
    [GC] occupied memory: 52880335344
    [GC] collections: 0
    [GC] max threshold: 0
    [GC] freed objects: 0
    [GC] max stack size: 0
    
    $ ./gcLat-releaseRTmp
    min(10)Worst(1000000): 180474627.575 +- 254640731.447 ns
    min(5)min(10)Worst(1000000): 8694.845 ns
    [GC] total memory: 68478275584
    [GC] occupied memory: 52880335344
    [GC] collections: 0
    [GC] max threshold: 0
    [GC] freed objects: 0
    [GC] max stack size: 0
    
    
    Run

Something to note is that the `--gc:arc` times are about 7 seconds for the 50 
outer loop min trials while the `--gc:useRealTimeGC` are more like 90 seconds. 
So that means the time window for "system shenanigans" to happen is in the max 
loop is 13X shorter for `--gc:arc` \-- 7s/50 = 140 ms vs 90s/50=1.8 seconds. 
So, the crazier variability may just come from that.

As mentioned before, shrinking the inner max test so it takes no more than 
10-50 ms instead of 1.8 sec would probably help a lot. The longer (and more 
times) you dare your system not to intervene with microsecond scale activity 
noise, the more it will take you up on your dare and the worse your numbers 
will be. :-) @disruptek's `golden` package iterates until the sdev is really 
small but in this case that could take almost literally forever. Better to make 
the system quiescence challenge easier. (But if you make it too easy you may be 
testing in pure L3/L2/L1 type memory hierarchy levels.)

Also, it seems the units for `GC_setMaxPause` are microseconds while the actual 
worst cases are nowhere near 100 microseconds (but for system noise). So, maybe 
all 4 of those numbers are really just sampling "the same thing" and they vary 
from 5.6 to 14.0 microseconds, almost a 3X range even with a min of 50, backing 
up how tricky a measurement problem this can be (not that this is disputed). 
But the main worry seemed to be about `-d:release` vs `-d:danger` in `--gc:arc` 
mode, and I think that was a noise-induced phantom concern (or possibly C 
compiler options).

Reply via email to