Here is what I looked at before I got bored: 
<https://github.com/treeform/raytracer/blob/master/nim/story.md>

Nim slower then C? How is that possible? Lets see
    
    
    CPU time [ms] 2018.0
    
    
    Run

Oh compiled debug mode Nim, with C -O3 ... that will not do. Debug mode inserts 
huge stack traces into every function call - much easer to debug, but it so 
slow!

**WOW x10 improvement!**
    
    
    nim c -r -d:release
    CPU time [ms] 188.0
    
    
    Run

But really you know what is better then release mode? **Danger mode** \- got to 
live dangerously!
    
    
    nim c -r -d:danger
    CPU time [ms] 196.0
    
    
    Run

Wait danger mode is slower? Lets try running it couple of more times....
    
    
    CPU time [ms] 198.0
    CPU time [ms] 209.0
    CPU time [ms] 242.0
    CPU time [ms] 181.0
    CPU time [ms] 195.0
    
    
    Run

Wow there is so much variance in this test. You can't really know anything from 
a single benchmark... bench, benchy? Thats right I wrote bench testing library 
exactly for this reason!
    
    
    import benchy
    
    
    Run

And then lets put stuff into a function. Did you know that stuff in side 
function can be better optimized because the yare more isolated from global 
state?
    
    
    proc main(): float =
      var t1 = cpuTime()
      var scene  = CreateScene()
      var width  = 500
      var height = 500
      var stride = width * 4
      var bitmapData = newSeq[RgbColor](width * height)
      
      RenderScene(scene, bitmapData, stride, width, height)
      var t2 = cpuTime()
      var diff = (t2 - t1) * 1000
      return diff
    
    timeIt "ray trace":
      keep main()
    
    
    
    Run
    
    
    name ............................... min time      avg time    std dv   runs
    ray trace ........................ 181.237 ms    191.066 ms   ±10.801    x26
    
    
    Run

Ok now we can actually measure this. Lets look at what vTune says is the bottle 
neck? Don't forget to add `--debugger:native` so that we get symbols in vTune.

Wow ObjectIntersect... wait what? Why are we setting results to some thing, 
just to take a clobber it again in the case?
    
    
    proc ObjectIntersect(obj: Thing, ray: Ray): Intersection =
      result = Intersection(thing: nil, ray: ray, dist: 0) # <---- slow part
      case obj.objectType:
        of Sphere:
          ...
          result.thing = obj
          result.ray   = ray
          result.dist  = dist
        of Plane:
          ...
          result.thing = obj
          result.ray   = ray
    
    
    Run

We can just like not do that? Be default nim inits objects to all zeros any 
ways.
    
    
    # result = Intersection(thing: nil, ray: ray, dist: 0)
    
    
    Run

Lets run it?
    
    
    name ............................... min time      avg time    std dv   runs
    ray trace ........................ 160.613 ms    164.691 ms    ±6.493    x30
    
    
    Run

Wow saved 21ms just on that line! Thats huge. What next vTune? Fight me bru!

ObjectIntersect is still at the top, but much better now. What parts of 
ObjectIntersect slow?

My fear is that those functions are not getting inlined properly. Lets throw 
`{.inline.}` in there? The rule if a function is small enough and is run often 
of then enough we can inline it.
    
    
    name ............................... min time      avg time    std dv   runs
    ray trace ........................ 160.544 ms    162.430 ms    ±3.049    x31
    
    
    Run

No change, I guess the compiler was smart enough to inline it all.

Lets try SIMD? We can just --passC:"-march=native" and change nothing:
    
    
    name ............................... min time      avg time    std dv   runs
    ray trace ........................ 157.290 ms    164.497 ms    ±8.950    x30
    
    
    Run

Oh great a 3ms win. Now we are faster then C. Great! Job done.

Next steps would be to review the algorithm, and maybe hand roll the SIMD 
instructions. But I am happy with the speedups.

Reply via email to