I didn't finish the previous mail before hitting "send", so here is the postscript...
OK, so I've had a bit of time to look at the actual test case. I missed one very important detail before: This is a vector-matrix operation. For this, we do not have a good library routine (Harald just removed it because of a bug in buffering), and -fexternal-blas does not work because we do not handle calls to anything but *GEMM.
A vector-matrix multiplicatin would be a call to *GEMV, a worthy goal, but out of scope so close to a release.
The idea is that, for a vector-matrix-multiplication, the compiler should have enough information about the information
about how to optimize for the relevant architecture, especially if the user compilers with the right flags. So, the current idea is that, if we optimize, we can inline. What would a better heuristic be? Best regards Thomas