Unfortunately, creating good benchmarks is hard. The benchmark above has some
subtle faults that lessen its effectiveness.
First off, MyBuffer is a tuple type:
type
MyBuffer = tuple
d: array[128, int]
len: int
Tuple types are object types, which means they can be allocated on the stack.
The only time object types are not allocated on the stack is when the object
type is part of a reference type.
Because MyBuffer is a tuple, all tx() has to do is allocate ~129 integers worth
of memory from the stack, which is a simple bump allocation (all the program
has to do is move the current stack pointer up by X). A better way to test
Nim's allocator is to compare it with the system malloc implementation.
Below is my version of the above benchmark. I've not tested it for Windows
users - the worst that might happen is that you get a rather large file called
'nul' full of numbers.
import times, random
proc malloc(size: uint): pointer {.header: "<stdlib.h>", importc: "malloc".}
proc free(p: pointer) {.header: "<stdlib.h>", importc: "free".}
const bufferSize = 128
type
MyBuffer = array[bufferSize, int]
proc testStackAllocation(): int =
var s: MyBuffer
result = cast[int](addr s)
proc testSequenceAllocation(): int =
var s = newSeqOfCap[int](bufferSize)
result = cast[int](addr s)
proc testMallocAllocation(): int =
var res = malloc(uint(sizeof(MyBuffer)))
free(res)
result = cast[int](res)
proc testRandom(): int =
result = random(7)
proc main =
# Use writing to /dev/null to prevent compiler optimizations
when defined(posix):
let nullfh = open("/dev/null", fmReadWrite)
else:
let nullfh = open("nul") # untested!
var baseline: float
# Establish a baseline time of a really simple operation + writing to
stdout
# This way we can esentially measure how fast stdout can be written to,
and
# factor that out of other measurements.
let z = cpuTime()
for i in 0..10_000_000:
nullfh.write(i)
baseline = cpuTime() - z
echo "Baseline time:", baseline
# Template to run test procedures.
template runProc(testProc: typed, testName: string): untyped =
let t = cpuTime()
for _ in 0..10_000_000:
var i = testProc()
nullfh.write(i)
echo "Time for ", testName, ": ", (cpuTime() - t) - baseline
# Test:
# - Stack allocation (which is usually a bump allocator)
# - Malloc allocation (system defined)
# - Nim allocation
# - Random number generation
runProc(testStackAllocation, "stack allocation test")
runProc(testSequenceAllocation, "sequence allocation test")
runProc(testMallocAllocation, "malloc allocation test")
runProc(testRandom, "random number generation test")
main()
Output using various GC backends (Mac OS Sierra, 2.5 GHz Intel Core i7) :
# nim c -d:release --passC:"-flto" --passL:"-flto" --gc:markAndSweep
benchmark.nim && ./benchmark
Baseline time: 1.072926
Time for stack allocation test: 0.1643110000000001
Time for sequence allocation test: 1.142116
Time for malloc allocation test: 0.8133139999999999
Time for random number generation test: -0.1466420000000002
# nim c -d:release --passC:"-flto" --passL:"-flto" benchmark.nim &&
./benchmark
Baseline time: 1.053752
Time for stack allocation test: 0.1456770000000001
Time for sequence allocation test: 1.227938
Time for malloc allocation test: 0.8247639999999994
Time for random number generation test: -0.1094160000000002
# Version of the benchmark with mark and sweep cycle collection disabled
# nim c -d:release --passC:"-flto" --passL:"-flto" benchmark.nim &&
./benchmark
Baseline time: 1.075432
Time for stack allocation test: 0.146002
Time for sequence allocation test: 1.189319
Time for malloc allocation test: 0.7623239999999996
Time for random number generation test: -0.1765280000000002
As you can see, Nim's allocator is slower than the system malloc allocator, but
not by much (I chalk some of this up to the compiler being able to use better
inlining and intrinsics for malloc). Neither comes close to stack allocation,
but again, that's expected.