https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68600
Joost VandeVondele <Joost.VandeVondele at mat dot ethz.ch> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |Joost.VandeVondele at mat dot ethz | |.ch --- Comment #7 from Joost VandeVondele <Joost.VandeVondele at mat dot ethz.ch> --- (In reply to Dominique d'Humieres from comment #6) > Note a problem when 16x16 matrices are inlined with -mavx (I'll investigate > and file a PR for it). that's a good find! I ran locally on haswell, and find these numbers, including openblas, and libxsmm. ./a.out Size Loops Matmul newmatmul dgemm-like dgemm fixed explicit internal libxsmm openblas ===================================================================================== 2 200000 1.562 0.107 0.104 0.139 4 200000 6.781 0.779 1.012 0.887 8 200000 7.424 3.360 6.150 4.732 16 200000 2.954 7.290 14.421 11.527 32 200000 10.401 10.251 24.396 18.071 64 30757 12.696 14.196 27.385 24.547 128 3829 8.646 17.684 31.460 31.530 256 477 7.834 19.123 37.457 37.471 512 59 8.064 19.473 40.738 40.755 1024 7 8.334 19.475 40.931 41.112 2048 1 3.042 19.157 41.225 41.279 so the 'newmatmul' code gets about 50% of peak. Inlined matmul is good up to size 8/16, 16-64 libxsmm wins, >64 openblas is better. For the small sizes it is mostly related to call eliminated overhead, I think.