I found a rather instructive tutorial about the kind of optimisations going 
into matrix multiplication in BLIS (not OpenBLAS but related) here:
http://apfel.mathematik.uni-ulm.de/%7Elehn/sghpc/gemm/index.html

It's not something one can implement in Julia (yet). Hopefully further work in 
the direction of vectorisation of tuples will help (e.g. issue #11899  and 
related)...

Reply via email to