You need a lot more than just fast loops to match the performance of an 
optimized BLAS.    See e.g. this notebook for some comments on the related 
case of matrix multiplication:

http://nbviewer.jupyter.org/url/math.mit.edu/~stevenj/18.335/Matrix-multiplication-experiments.ipynb

Reply via email to