For some reasons it seems like no one cares about optimizing matrix 
multiplication for integers. I started my own so that my tensor library isn't 
painfully slow and there is no need to cast to float.

For future optimizations I'd like to know the following:

  1. It doesn't seem possible to pass a specific path to GCC/Clang to the nim 
compiler right? Clang on OSX doesn't support OpenMP, so I'd like to configure 
compilation with default clang-omp installed by Homebrew.
  2. I saw the unroll pragma and the doc says it’s seen but ignored. Is it 
still true? That would be very helpful.
  3. Is Nim compiled automatically with -march=native? (I can use —passC 
otherwise I guess).
  4. How to force a specific memory alignment
  5. Is there a way to get L1/L2 cache size at compile time.
  6. Is there a way to check number of registers and their size at compile 
time. (Bonus if I can feed that to the unroll pragma)



I've based my code on [pure C 
ulmBLAS](http://apfel.mathematik.uni-ulm.de/~lehn/sghpc/gemm/page02/index.html#toc4).
 The big difference is that instead of pointers arithmetics, I'm passing a var 
array and an offset.

I might change to pointers because it's cumbersome as I need to do array[i + 
offset] and offset += increment everywhere in my code.

Last thing, I'm using global var arrays currently 
[here](https://github.com/mratsim/Arraymancer/blob/63560bc5c55cae33f023c57fa7b2077d56b03d0b/src/arraymancer/fallback/blas_l3_gemm.nim#L49).
 If I declare those directly in the proc instead 
[here](https://github.com/mratsim/Arraymancer/blob/63560bc5c55cae33f023c57fa7b2077d56b03d0b/src/arraymancer/fallback/blas_l3_gemm.nim#L81)
 the program compiles but I get Segmentation fault 11 at runtime. I will try to 
get a small test case.

Reply via email to