For reference I adapted that Araq benchmark to monitor run-to-run variability:
import times, strutils, stats, cligen
type Msg = seq[byte]
type Buffer = seq[Msg]
proc mkMessage(n: int): Msg =
result = newSeq[byte](1024)
for i in 0 ..< result.len:
result[i] = byte(n)
proc pushMsg(b: var Buffer, highID: int,
worst: var float, window: int) =
let t0 = epochTime() * 1e9
let m = mkMessage(highID)
when not defined(gcDestructors):
shallowCopy(b[highID mod window], m)
else:
# new runtime can move this, no need for 'shallowCopy'
b[highID mod window] = m
#GC_step(500) # 0.5ms
worst = max(worst, epochTime()*1e9 - t0)
proc gcLat*(sizeKiB=1024, nMsg=1_000_000, avgN=5,
minN=10, window=200_000, count=1_000_000) =
proc fmt(x=0.0, n=3): auto=formatFloat(x,ffDecimal,n)
#GC_disable()
when not defined(gcDestructors):
when declared(GC_setMaxPause):
GC_setMaxPause(100)
GC_disableMarkAndSweep()
var dtMins: RunningStat
for avgIt in 1..avgN:
var dtMin = float.high
for minIt in 1..minN:
var worst = 0.0
var b = newSeq[Msg](window)
for i in 0 ..< count:
pushMsg(b, i, worst, window)
dtMin = min(dtMin, worst)
dtMins.push dtMin
echo "min(", minN, ")Worst(", count, "): ",
fmt(dtMins.mean), " +- ",
fmt(dtMins.standardDeviationS), " ns"
echo "min(", avgN, ")min(", minN, ")Worst(", count,
"): ", fmt(dtMins.min), " ns"
echo GC_getStatistics()
dispatch(gcLat)
Run
I was careful to make sure `opt:speed` was on explicitly as my nim.cfg seemed
to have d:release not => opt:speed, but d:danger did. Indeed, what I ran was:
nim c --verbosity:2 -d:release --cc:gcc --opt:speed --gc:arc
--out:gcLat-release gcLat.nim
Run
and similarly with a `-d:release -d:danger` to be visually sure from the
command lines that gcc with -O3 was running.
Then I got these results. I had to actually run it a few times to get standard
deviations less than 50% of the measured number. (Basically, a **_lot_** of
things on any multi-tasking system take 3 microseconds here and there and many
of them even happen semi-regularly and all hope in the reliability of these
numbers tracks to that "semi-" and/or desynchronization of those things with
the benchmark.)
$ ./gcLat-danger
min(10)Worst(1000000): 2921.769 +- 45.925 ns
min(5)min(10)Worst(1000000): 2857.093 ns
[GC] total memory: 284450816
[GC] occupied memory: 214422056
$ ./gcLat-release
min(10)Worst(1000000): 2973.351 +- 155.380 ns
min(5)min(10)Worst(1000000): 2744.125 ns
[GC] total memory: 284450816
[GC] occupied memory: 214422056
Run
In this test anyway, danger mode was only 2857./2744=1.04X faster for
`--gc:arc`. It's best to use the min(min(worst)) to compare. In this case that
mean +- sdev as more a tool to decide how well you suppressed system noise. I
definitely confirmed it's not easy to suppress that noise on my mostly idle
system. If you can reboot into single user mode and do that `chrt` stuff then
you may have to try less/use fewer trials to get consistent behavior, and
`isolcpus` even more so.
Also, 3 microseconds seems pretty good..equivalent to about 40 DIMM accesses on
the machine I was testing on. I also timed the old `-d:useRealTimeGC` modes
(without and with the max pause) but in my multi-user/network/on/firefox
running set up I got much more inconsistent numbers:
$ ./gcLat-releaseRT
min(10)Worst(1000000): 47747507.585 +- 90770361.271 ns
min(5)min(10)Worst(1000000): 9039.414 ns
[GC] total memory: 68478275584
[GC] occupied memory: 52880335344
[GC] collections: 0
[GC] max threshold: 0
[GC] freed objects: 0
[GC] max stack size: 0
$ ./gcLat-dangerRT
min(10)Worst(1000000): 98322406.115 +- 207979237.712 ns
min(5)min(10)Worst(1000000): 14035.128 ns
[GC] total memory: 68478275584
[GC] occupied memory: 52880335344
[GC] collections: 0
[GC] max threshold: 0
[GC] freed objects: 0
[GC] max stack size: 0
$ ./gcLat-dangerRTmp
min(10)Worst(1000000): 124940125.135 +- 188809651.190 ns
min(5)min(10)Worst(1000000): 5578.240 ns
[GC] total memory: 68478275584
[GC] occupied memory: 52880335344
[GC] collections: 0
[GC] max threshold: 0
[GC] freed objects: 0
[GC] max stack size: 0
$ ./gcLat-releaseRTmp
min(10)Worst(1000000): 180474627.575 +- 254640731.447 ns
min(5)min(10)Worst(1000000): 8694.845 ns
[GC] total memory: 68478275584
[GC] occupied memory: 52880335344
[GC] collections: 0
[GC] max threshold: 0
[GC] freed objects: 0
[GC] max stack size: 0
Run
Something to note is that the `--gc:arc` times are about 7 seconds for the 50
outer loop min trials while the `--gc:useRealTimeGC` are more like 90 seconds.
So that means the time window for "system shenanigans" to happen is in the max
loop is 13X shorter for `--gc:arc` \-- 7s/50 = 140 ms vs 90s/50=1.8 seconds.
So, the crazier variability may just come from that.
As mentioned before, shrinking the inner max test so it takes no more than
10-50 ms instead of 1.8 sec would probably help a lot. The longer (and more
times) you dare your system not to intervene with microsecond scale activity
noise, the more it will take you up on your dare and the worse your numbers
will be. :-) @disruptek's `golden` package iterates until the sdev is really
small but in this case that could take almost literally forever. Better to make
the system quiescence challenge easier. (But if you make it too easy you may be
testing in pure L3/L2/L1 type memory hierarchy levels.)
Also, it seems the units for `GC_setMaxPause` are microseconds while the actual
worst cases are nowhere near 100 microseconds (but for system noise). So, maybe
all 4 of those numbers are really just sampling "the same thing" and they vary
from 5.6 to 14.0 microseconds, almost a 3X range even with a min of 50, backing
up how tricky a measurement problem this can be (not that this is disputed).
But the main worry seemed to be about `-d:release` vs `-d:danger` in `--gc:arc`
mode, and I think that was a noise-induced phantom concern (or possibly C
compiler options).