That's true, and therefore also not in Julia, unless using some command to 
inline assembly. However, in C it might be possible to get to a factor 2 of 
BLAS speed. This might be sufficient if you want to implement something 
slightly different from matrix multiplication (like maybe this case) and where 
you might create extra overhead when trying to reformulate it using BLAS matrix 
multiplication.

Reply via email to