I found a rather instructive tutorial about the kind of optimisations going into matrix multiplication in BLIS (not OpenBLAS but related) here: http://apfel.mathematik.uni-ulm.de/%7Elehn/sghpc/gemm/index.html
It's not something one can implement in Julia (yet). Hopefully further work in the direction of vectorisation of tuples will help (e.g. issue #11899 and related)...
