If you have instructions to reproduce the benchmark in Nim and C++ I can help.

I just need a repo to clone, the scripts to run and the dataset. I already have 
Vapoursynth working.

Ideally you have a profiler like Intel Instruments or Apple VTune to dive into 
assembly.

For example this is my approach in debugging performance issue: 
[https://github.com/nim-lang/Nim/issues/9514](https://github.com/nim-lang/Nim/issues/9514)

For memory bottlenecks it's a bit different, I use the roofline model as 
mentioned in my convolution optimization resources: 
[https://github.com/numforge/laser/blob/master/research/convolution_optimisation_resources.md#computational-complexity](https://github.com/numforge/laser/blob/master/research/convolution_optimisation_resources.md#computational-complexity)

For example I know that matrix multiplication and convolution can reach 90% of 
the peak CPU GFlop because their arithmetic intensity is high (i.e. you do over 
10 operations (add/mul) per byte) so if you don't reach that perf it's because 
you are moving bytes to much instead of computing with them.

The theoretical peak of your CPU is easy to compute:

  * single threaded:

> `CpuGhz * VectorWidth * InstrCycle * FlopInstr`

> for a CPU that supports AVX2 on float32 (so packing 8 float32) that can issue 
> 2 Fused-Multiply-Add per cycle at 3GHz we have

`3 (GHz) * 8 (packed float32 in AVX) * 2 (FMA per cycle) * 2 (FMA = 1 add + 1 
mul)` `= 96 GFlops`

  * multithreaded: Just multiply the single result by the number of cores. For 
example 10 cores would be 960 GFlops or 0.9 TFlops



And then the usual way to benchmark numerical algorithm is, you know the number 
of operations required by your algorithm, you divide that by the time spent to 
do them and you have your actual flops. And you compare your actual Flops with 
the theoretical peak. If you only reach 20% of the peak, you have a memory 
bottleneck and probably need to repack before processing to optimize cache 
usage, if not you need to look into SIMD vectorization, prefetching, ...

All of that is quite complex so what I can do is reach the naive C++ 
implementation performance.

Going beyond is something that I want to do but it's time-consuming and I feel 
that it would be better to spend my time on an image processing compiler 
similar to what's discussed here: 
[https://github.com/mratsim/Arraymancer/issues/347#issuecomment-459351890](https://github.com/mratsim/Arraymancer/issues/347#issuecomment-459351890)
 and with a proof of concept here:

  * 
[https://github.com/numforge/laser/tree/master/laser/lux_compiler](https://github.com/numforge/laser/tree/master/laser/lux_compiler)
  * 
[https://github.com/numforge/laser/tree/master/laser/lux_compiler/core](https://github.com/numforge/laser/tree/master/laser/lux_compiler/core)


Reply via email to