I've updated my Laser benchmarks to include Manu:
* bench:
[https://github.com/numforge/laser/blob/e660eeeb/benchmarks/gemm/gemm_bench_float64.nim#L215-L248](https://github.com/numforge/laser/blob/e660eeeb/benchmarks/gemm/gemm_bench_float64.nim#L215-L248)
Unfortunately, Manu is about **250x slower** than OpenBLAS and 200x slower than
Laser when multiplying two 960x960 matrices. My machine is 18 cores, so even on
single-threaded, the difference would be over 10x.
That difference grows in n^3 with matrix sizes.
# Run 1: OpenBLAS vs Manu
A matrix shape: (M: 960, N: 960)
B matrix shape: (M: 960, N: 960)
Output shape: (M: 960, N: 960)
Required number of operations: 1769.472 millions
Required bytes: 14.746 MB
Arithmetic intensity: 120.000 FLOP/byte
Theoretical peak single-core: 112.000 GFLOP/s
Theoretical peak multi: 2016.000 GFLOP/s
Make sure to not bench Apple Accelerate or the default Linux BLAS.
Due to strange OpenMP interferences, separate the run of code-sections
using OpenMP, see https://github.com/numforge/laser/issues/40
OpenBLAS benchmark
Collected 10 samples in 0.033 seconds
Average time: 3.256 ms
Stddev time: 0.567 ms
Min time: 2.910 ms
Max time: 4.715 ms
Perf: 543.396 GFLOP/s
Display output[0] to make sure it's not optimized away
232.3620566397699
Manu implementation
Collected 10 samples in 8.477 seconds
Average time: 847.700 ms
Stddev time: 10.644 ms
Min time: 842.805 ms
Max time: 877.909 ms
Perf: 2.087 GFLOP/s
Display output[0] to make sure it's not optimized away
237.8399578000516
# Run 2: Laser vs Manu
Laser production implementation
Collected 10 samples in 0.041 seconds
Average time: 4.008 ms
Stddev time: 5.121 ms
Min time: 2.232 ms
Max time: 18.579 ms
Perf: 441.537 GFLOP/s
Display output[0] to make sure it's not optimized away
232.36205663977
Manu implementation
Collected 10 samples in 8.490 seconds
Average time: 848.983 ms
Stddev time: 0.997 ms
Min time: 847.062 ms
Max time: 850.112 ms
Perf: 2.084 GFLOP/s
Display output[0] to make sure it's not optimized away
237.8399578000516
Run
I expect that the original Jama is even worse and probably around 750x to 1250x
slower than a proper BLAS due to the usual 3x to 5x speed difference between
Java and C/Nim.