For some reasons it seems like no one cares about optimizing matrix multiplication for integers. I started my own so that my tensor library isn't painfully slow and there is no need to cast to float.
For future optimizations I'd like to know the following: 1. It doesn't seem possible to pass a specific path to GCC/Clang to the nim compiler right? Clang on OSX doesn't support OpenMP, so I'd like to configure compilation with default clang-omp installed by Homebrew. 2. I saw the unroll pragma and the doc says it’s seen but ignored. Is it still true? That would be very helpful. 3. Is Nim compiled automatically with -march=native? (I can use —passC otherwise I guess). 4. How to force a specific memory alignment 5. Is there a way to get L1/L2 cache size at compile time. 6. Is there a way to check number of registers and their size at compile time. (Bonus if I can feed that to the unroll pragma) I've based my code on [pure C ulmBLAS](http://apfel.mathematik.uni-ulm.de/~lehn/sghpc/gemm/page02/index.html#toc4). The big difference is that instead of pointers arithmetics, I'm passing a var array and an offset. I might change to pointers because it's cumbersome as I need to do array[i + offset] and offset += increment everywhere in my code. Last thing, I'm using global var arrays currently [here](https://github.com/mratsim/Arraymancer/blob/63560bc5c55cae33f023c57fa7b2077d56b03d0b/src/arraymancer/fallback/blas_l3_gemm.nim#L49). If I declare those directly in the proc instead [here](https://github.com/mratsim/Arraymancer/blob/63560bc5c55cae33f023c57fa7b2077d56b03d0b/src/arraymancer/fallback/blas_l3_gemm.nim#L81) the program compiles but I get Segmentation fault 11 at runtime. I will try to get a small test case.
