You need a lot more than just fast loops to match the performance of an optimized BLAS. See e.g. this notebook for some comments on the related case of matrix multiplication:
http://nbviewer.jupyter.org/url/math.mit.edu/~stevenj/18.335/Matrix-multiplication-experiments.ipynb
