On Thu, Mar 18, 2021 at 07:24:21PM +0100, Thomas Koenig wrote: > I didn't finish the previous mail before hitting "send", so here > is the postscript... > > > OK, so I've had a bit of time to look at the actual test case. I > > missed one very important detail before: This is a vector-matrix > > operation. > > > > For this, we do not have a good library routine (Harald just > > removed it because of a bug in buffering), and -fexternal-blas > > does not work because we do not handle calls to anything but > > *GEMM. > > A vector-matrix multiplicatin would be a call to *GEMV, a worthy > goal, but out of scope so close to a release.
Agreed. > > The idea is that, for a vector-matrix-multiplication, the > > compiler should have enough information about the information > about how to optimize for the relevant architecture, especially > if the user compilers with the right flags. > > So, the current idea is that, if we optimize, we can inline. > > What would a better heuristic be? > Does _gfortran_matmul_r4 (and friends) work for vector-matrix products? I haven't checked. If so, how about disabling in-lining MATMUL for 11.1; then, for 11.2, this can be revisited where a small N can be chosen for in-lining. With -fexternal-blas and *gemm, the default cross-over is N = 30. BTW, I cam across this in StackOverflow. https://stackoverflow.com/questions/66682180/why-is-matmul-slower-with-gfortran-compiler-optimization-turned-on -- Steve