I'm not sure what the expected result here is. BLAS is designed to be as fast as possible at matrix multiply. I'd be more concerned if you write straightforward loop code and beat BLAS, since that means the BLAS is badly mistuned.
On Mon, Mar 21, 2016 at 5:58 AM, Igor Cerovsky <[email protected]> wrote: > Thanks Steven, I've thought there is something more behind... > > I shall note that that I forgot to mention matrix dimensions, which is > 1000 x 1000. > > On Monday, 21 March 2016 10:48:33 UTC+1, Steven G. Johnson wrote: >> >> You need a lot more than just fast loops to match the performance of an >> optimized BLAS. See e.g. this notebook for some comments on the related >> case of matrix multiplication: >> >> >> http://nbviewer.jupyter.org/url/math.mit.edu/~stevenj/18.335/Matrix-multiplication-experiments.ipynb >> >
