> For example my SIMD definition for SSE2 and AVX512 mtrix multiplication which 
> allows me, in thousand of lines of Nim code to be as fast as 50x more pure 
> assembly lines in OpenBLAS

Can you please elaborate on that? I looked into your code and since you're 
using intrinsics you're dependant on the mercy of the compiler to schedule 
everything right, is the assembler code you're talking about worse than what 
the compiler archieve? Or do you use some faster algorithm to perform the 
matrix multiplication?

Reply via email to