> For example my SIMD definition for SSE2 and AVX512 mtrix multiplication which > allows me, in thousand of lines of Nim code to be as fast as 50x more pure > assembly lines in OpenBLAS
Can you please elaborate on that? I looked into your code and since you're using intrinsics you're dependant on the mercy of the compiler to schedule everything right, is the assembler code you're talking about worse than what the compiler archieve? Or do you use some faster algorithm to perform the matrix multiplication?
