Hi Steve,
On my old core2 cpu, a quick test with N=1000 and NxN matrix suggest a cross over near N=1000 for REAL(4). This cpu doesn't have any AVX* instruction, so YMMV. Program follows .sig
Looking at your data with AVX (which I think we can mostly count on now), - The library is always faster for matmul(vector,matrix) for any n >=100 - For matmul(matrix,vector) there is no appreciable difference So, putting in the same inline limits for matmul(vector,matrix) that we have for matmul(matrix,matrix), and leaving mamul(matrix,vector) alone, seems like a reasonable thing to do. I'll work on a patch. Regards Thomas