[Bug fortran/68600] Inlined MATMUL is too slow.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68600 Jerry DeLisle changed: What|Removed |Added Status|WAITING |RESOLVED Resolution|--- |FIXED --- Comment #16 from Jerry DeLisle --- (In reply to Thomas Koenig from comment #15) > I think that with the current status, where > we have -finline-matmul-limit=30 by default, we > can close this bug. > > Agreed? Yes, this can be closed.
[Bug fortran/68600] Inlined MATMUL is too slow.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68600 Thomas Koenig changed: What|Removed |Added Status|NEW |WAITING --- Comment #15 from Thomas Koenig --- I think that with the current status, where we have -finline-matmul-limit=30 by default, we can close this bug. Agreed?
[Bug fortran/68600] Inlined MATMUL is too slow.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68600 --- Comment #14 from Thomas Koenig --- Question: Would it make sense to add an option so that only matrices with size known at compile-time are inlined? Somethin like -finline-matmul-size-var=0 (to disable), -finline-matmul-size-fixed=5 (to inline for fixed size up to 5 only).
[Bug fortran/68600] Inlined MATMUL is too slow.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68600 --- Comment #13 from Jerry DeLisle --- (In reply to Thomas Koenig from comment #12) > (In reply to Jerry DeLisle from comment #11) ---snip-- > > May I suggest reading the docs? ;-) > --- snip --- > The default value for N is the value specified for > `-fblas-matmul-limit' if this option is specified, or unlimitited > otherwise. Oh gosh!, Sorry about that Thomas. I just did not see it. And I was even looking for it because I thought it was there! This is excellent because I am working on a modification to the run-time libraries. This will give us something like: Matmul fixed Size Loops explicit NewMatmul 16 2000 1.496 1.719 32 2000 2.427 1.784 64 2000 1.343 1.967 128 2000 1.657 2.113 256 477 2.660 2.185 51259 2.027 2.195 1024 7 1.530 2.208 2048 1 1.516 2.210 On this particular machine, the inlining at high levels of optimization has some sweet spots at size of 32 x 32 for example, so allowing the tuning is essential depending on users application.
[Bug fortran/68600] Inlined MATMUL is too slow.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68600 --- Comment #12 from Thomas Koenig --- (In reply to Jerry DeLisle from comment #11) > I was experimenting some more here a few days ago. I really think that > inlineing shold be disabled above some threshold. On larger arrays, the > runtime library outperforms inline and right now by default the runtime > routines are never used unless you provide -fno-frontend-optimize which is > counter intuitive for the larger arrays. May I suggest reading the docs? ;-) `-finline-matmul-limit=N' When front-end optimiztion is active, some calls to the `MATMUL' intrinsic function will be inlined. This may result in code size increase if the size of the matrix cannot be determined at compile time, as code for both cases is generated. Setting `-finline-matmul-limit=0' will disable inlining in all cases. Setting this option with a value of N will produce inline code for matrices with size up to N. If the matrices involved are not square, the size comparison is performed using the geometric mean of the dimensions of the argument and result matrices. > If one compiles with -march=native -mavx -Ofast etc etc, the inline can do > fairly well on the larger, however when we update the runtime routines to > something like shown in comment #8 it will make even more sense to not do > inline all the time. (Unless of course we further optimize the > frontend-optimize to do better.) We can give this option a reasonable default value. The current status is The default value for N is the value specified for `-fblas-matmul-limit' if this option is specified, or unlimitited otherwise.
[Bug fortran/68600] Inlined MATMUL is too slow.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68600 --- Comment #11 from Jerry DeLisle --- (In reply to Jerry DeLisle from comment #8) > Created attachment 36887 [details] > A faster version > > I took the example code found in > http://wiki.cs.utexas.edu/rvdg/HowToOptimizeGemm/ where the register based > vector computations are explicitly called via the SSE registers and > converted it to use the builtin gcc vector extensions. I had to experiment > a little to get some of the equivalent operations of the original code. > > With only -O2 and march=native I am getting good results. I need to roll > this into the other test program yet to confirm the gflops are being > computed correctly. The diff value is comparing to the reference naive > results to check the computation is correct. > > MY_MMult = [ > Size: 40, Gflops: 1.828571e+00, Diff: 2.664535e-15 > Size: 80, Gflops: 3.696751e+00, Diff: 7.105427e-15 > Size: 120, Gflops: 4.051583e+00, Diff: 1.065814e-14 > Size: 160, Gflops: 4.015686e+00, Diff: 1.421085e-14 > Size: 200, Gflops: 4.029212e+00, Diff: 2.131628e-14 > Size: 240, Gflops: 3.972414e+00, Diff: 2.486900e-14 > Size: 280, Gflops: 3.881188e+00, Diff: 2.842171e-14 > Size: 320, Gflops: 3.872371e+00, Diff: 3.552714e-14 > Size: 360, Gflops: 3.887676e+00, Diff: 4.973799e-14 > Size: 400, Gflops: 3.862052e+00, Diff: 4.973799e-14 > Size: 440, Gflops: 3.886575e+00, Diff: 4.973799e-14 > Size: 480, Gflops: 3.910124e+00, Diff: 6.039613e-14 > Size: 520, Gflops: 3.863706e+00, Diff: 6.394885e-14 > Size: 560, Gflops: 3.976947e+00, Diff: 6.750156e-14 > Size: 600, Gflops: 4.002631e+00, Diff: 7.460699e-14 > Size: 640, Gflops: 3.992507e+00, Diff: 8.171241e-14 > Size: 680, Gflops: 3.964570e+00, Diff: 9.237056e-14 > Size: 720, Gflops: 3.973661e+00, Diff: 1.101341e-13 > Size: 760, Gflops: 3.982346e+00, Diff: 1.065814e-13 > Size: 800, Gflops: 3.869291e+00, Diff: 8.881784e-14 > Size: 840, Gflops: 3.936271e+00, Diff: 1.065814e-13 > Size: 880, Gflops: 3.931259e+00, Diff: 1.030287e-13 > Size: 920, Gflops: 3.912907e+00, Diff: 1.207923e-13 > Size: 960, Gflops: 3.938391e+00, Diff: 1.278977e-13 > Size: 1000, Gflops: 3.945754e+00, Diff: 1.421085e-13 (In reply to Dominique d'Humieres from comment #10) > > I think you are seeing the effects of inefficiencies of assumed-shape > > arrays. > > > > If you want to use matmul on very small matrix sizes, it is best to > > use fixed-size explicit arrays. > > Then IMO the matmul inlining should be restricted to fixed-size explicit > arrays. Could this be done before the release of gcc-6? I was experimenting some more here a few days ago. I really think that inlineing shold be disabled above some threshold. On larger arrays, the runtime library outperforms inline and right now by default the runtime routines are never used unless you provide -fno-frontend-optimize which is counter intuitive for the larger arrays. If one compiles with -march=native -mavx -Ofast etc etc, the inline can do fairly well on the larger, however when we update the runtime routines to something like shown in comment #8 it will make even more sense to not do inline all the time. (Unless of course we further optimize the frontend-optimize to do better.)
[Bug fortran/68600] Inlined MATMUL is too slow.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68600 --- Comment #10 from Dominique d'Humieres --- > I think you are seeing the effects of inefficiencies of assumed-shape arrays. > > If you want to use matmul on very small matrix sizes, it is best to > use fixed-size explicit arrays. Then IMO the matmul inlining should be restricted to fixed-size explicit arrays. Could this be done before the release of gcc-6?
[Bug fortran/68600] Inlined MATMUL is too slow.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68600 --- Comment #9 from Thomas Koenig --- > I took the example code found in > http://wiki.cs.utexas.edu/rvdg/HowToOptimizeGemm/ where the register based > vector computations are explicitly called via the SSE registers and > converted it to use the builtin gcc vector extensions. I had to experiment > a little to get some of the equivalent operations of the original code. Nice one, I think we can use this as a basis for a new library function in 7.1. It probably will be necessary to add some zero-padding in some places where the matix dimensions are not divisible by four.
[Bug fortran/68600] Inlined MATMUL is too slow.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68600 --- Comment #8 from Jerry DeLisle --- Created attachment 36887 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36887=edit A faster version I took the example code found in http://wiki.cs.utexas.edu/rvdg/HowToOptimizeGemm/ where the register based vector computations are explicitly called via the SSE registers and converted it to use the builtin gcc vector extensions. I had to experiment a little to get some of the equivalent operations of the original code. With only -O2 and march=native I am getting good results. I need to roll this into the other test program yet to confirm the gflops are being computed correctly. The diff value is comparing to the reference naive results to check the computation is correct. MY_MMult = [ Size: 40, Gflops: 1.828571e+00, Diff: 2.664535e-15 Size: 80, Gflops: 3.696751e+00, Diff: 7.105427e-15 Size: 120, Gflops: 4.051583e+00, Diff: 1.065814e-14 Size: 160, Gflops: 4.015686e+00, Diff: 1.421085e-14 Size: 200, Gflops: 4.029212e+00, Diff: 2.131628e-14 Size: 240, Gflops: 3.972414e+00, Diff: 2.486900e-14 Size: 280, Gflops: 3.881188e+00, Diff: 2.842171e-14 Size: 320, Gflops: 3.872371e+00, Diff: 3.552714e-14 Size: 360, Gflops: 3.887676e+00, Diff: 4.973799e-14 Size: 400, Gflops: 3.862052e+00, Diff: 4.973799e-14 Size: 440, Gflops: 3.886575e+00, Diff: 4.973799e-14 Size: 480, Gflops: 3.910124e+00, Diff: 6.039613e-14 Size: 520, Gflops: 3.863706e+00, Diff: 6.394885e-14 Size: 560, Gflops: 3.976947e+00, Diff: 6.750156e-14 Size: 600, Gflops: 4.002631e+00, Diff: 7.460699e-14 Size: 640, Gflops: 3.992507e+00, Diff: 8.171241e-14 Size: 680, Gflops: 3.964570e+00, Diff: 9.237056e-14 Size: 720, Gflops: 3.973661e+00, Diff: 1.101341e-13 Size: 760, Gflops: 3.982346e+00, Diff: 1.065814e-13 Size: 800, Gflops: 3.869291e+00, Diff: 8.881784e-14 Size: 840, Gflops: 3.936271e+00, Diff: 1.065814e-13 Size: 880, Gflops: 3.931259e+00, Diff: 1.030287e-13 Size: 920, Gflops: 3.912907e+00, Diff: 1.207923e-13 Size: 960, Gflops: 3.938391e+00, Diff: 1.278977e-13 Size: 1000, Gflops: 3.945754e+00, Diff: 1.421085e-13
[Bug fortran/68600] Inlined MATMUL is too slow.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68600 Dominique d'Humieres changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2015-11-30 Ever confirmed|0 |1 --- Comment #6 from Dominique d'Humieres --- > I think you are seeing the effects of inefficiencies of assumed-shape arrays. > > If you want to use matmul on very small matrix sizes, it is best to > use fixed-size explicit arrays. Well, the problem is that MATMUL inlining is the default. IMO it should be restricted to fixed-size explicit arrays (and small matrices?), at least for the 6.1 version. > Created attachment 36869 [details] > Thomas program with a modified dgemm. > > The dgemm in this example is a stripped out version of an "optimized for > cache" > version from netlib.org. I stripped out a lot of the unused code. It is probably too late for 6.1, but the results are quite impressive (~30Gflops/s peak): [Book15] f90/bug% gfc -Ofast timing/matmul_sys_8jd.f90 [Book15] f90/bug% a.out Size Loops Matmul dgemm Matmul Matmul fixed explicitassumed variable explicit = 220 0.969 0.104 0.360 0.368 420 5.821 0.774 1.381 1.049 820 5.415 2.970 2.316 2.342 1620 6.455 4.917 2.738 3.225 3220 7.332 5.964 2.893 4.117 64 30757 5.565 7.277 2.785 3.830 128 3829 4.790 7.982 2.981 4.384 256 477 4.674 8.375 3.077 4.675 51259 4.797 8.200 3.156 4.786 1024 7 3.967 8.370 2.896 4.050 2048 1 3.693 8.414 2.804 3.650 [Book15] f90/bug% gfc -Ofast -mavx timing/matmul_sys_8jd.f90 [Book15] f90/bug% a.out Size Loops Matmul dgemm Matmul Matmul fixed explicitassumed variable explicit = 220 0.956 0.106 0.372 0.469 420 7.805 0.715 1.334 1.462 820 7.520 3.222 2.292 3.482 1620 3.001 6.406 2.671 4.917 3220 8.886 8.530 2.900 6.136 64 30757 10.203 10.998 2.677 6.770 128 3829 6.742 13.367 2.831 6.774 256 477 6.435 13.979 2.906 6.049 51259 6.592 15.041 2.991 6.273 1024 7 5.247 14.639 2.775 4.922 2048 1 4.309 13.976 2.739 4.176 Note a problem when 16x16 matrices are inlined with -mavx (I'll investigate and file a PR for it).
[Bug fortran/68600] Inlined MATMUL is too slow.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68600 Joost VandeVondele changed: What|Removed |Added CC||Joost.VandeVondele at mat dot ethz ||.ch --- Comment #7 from Joost VandeVondele --- (In reply to Dominique d'Humieres from comment #6) > Note a problem when 16x16 matrices are inlined with -mavx (I'll investigate > and file a PR for it). that's a good find! I ran locally on haswell, and find these numbers, including openblas, and libxsmm. ./a.out Size Loops Matmul newmatmul dgemm-like dgemm fixed explicit internal libxsmm openblas = 220 1.562 0.107 0.104 0.139 420 6.781 0.779 1.012 0.887 820 7.424 3.360 6.150 4.732 1620 2.954 7.290 14.421 11.527 3220 10.401 10.251 24.396 18.071 64 30757 12.696 14.196 27.385 24.547 128 3829 8.646 17.684 31.460 31.530 256 477 7.834 19.123 37.457 37.471 51259 8.064 19.473 40.738 40.755 1024 7 8.334 19.475 40.931 41.112 2048 1 3.042 19.157 41.225 41.279 so the 'newmatmul' code gets about 50% of peak. Inlined matmul is good up to size 8/16, 16-64 libxsmm wins, >64 openblas is better. For the small sizes it is mostly related to call eliminated overhead, I think.
[Bug fortran/68600] Inlined MATMUL is too slow.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68600 --- Comment #3 from Thomas Koenig --- Created attachment 36868 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36868=edit Modified benchmark (really this time) Hi Dominique, I think you are seeing the effects of inefficiencies of assumed-shape arrays. If you want to use matmul on very small matrix sizes, it is best to use fixed-size explicit arrays. Below the results of the modified benchmark (some changes to keep the optimizer honest, such as a call to a dummy subroutine) on my rather dated home box: Size Loops Matmul dgemm Matmul Matmul fixed explicitassumed variable explicit = 220 11.948 0.072 0.142 0.411 420 1.711 0.417 0.534 0.861 820 2.314 0.953 0.858 1.076 1620 1.745 1.276 0.918 1.000 3220 1.459 1.456 1.371 1.436 64 30757 1.501 1.440 1.360 1.393 128 3829 1.586 1.544 1.557 1.529 256 477 1.531 1.519 1.544 1.507 51259 1.315 1.290 1.263 1.231 1024 7 1.110 1.081 1.069 1.053 2048 1 1.095 1.086 1.081 1.058
[Bug fortran/68600] Inlined MATMUL is too slow.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68600 Jerry DeLisle changed: What|Removed |Added CC||jvdelisle at gcc dot gnu.org --- Comment #4 from Jerry DeLisle --- Created attachment 36869 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36869=edit Thomas program with a modified dgemm. The dgemm in this example is a stripped out version of an "optimized for cache" version from netlib.org. I stripped out a lot of the unused code. Results show better performance for larger arrays. Maybe we could model the library routines after this and invoke for larger arrays. Size Loops Matmul dgemm Matmul Matmul fixed explicitassumed variable explicit == 220 1.752 0.042 0.124 0.295 420 2.172 0.314 0.434 0.704 820 2.293 1.071 0.721 1.127 1620 2.826 1.533 0.972 1.468 3220 2.707 1.666 1.184 2.154 64 30757 2.726 1.853 1.192 2.299 128 3829 2.641 1.965 1.379 2.542 256 477 2.661 2.001 1.384 2.594 51259 1.740 2.011 1.147 1.746 1024 7 1.344 2.024 1.070 1.355 2048 1 1.305 2.026 1.088 1.312
[Bug fortran/68600] Inlined MATMUL is too slow.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68600 --- Comment #5 from Thomas Koenig --- Another interesting data point. I deleted the DGEMM implementation from the file and linked against the serial version of openblas. OK, openblas is based on GOTO blas, so we have to expect a hit for large matrices. Figures: ig25@linux-fd1f:~/Krempel/Bench> gfortran -O2 -funroll-loops bench-3.f90 -lopenblas_serial ig25@linux-fd1f:~/Krempel/Bench> ./a.out Size Loops Matmul dgemm Matmul Matmul fixed explicitassumed variable explicit = 220 11.944 0.035 0.136 0.412 420 1.712 0.257 0.458 0.738 820 2.080 1.162 0.824 1.077 1620 1.697 3.104 0.939 0.995 3220 1.450 4.814 1.388 1.426 64 30757 1.485 5.978 1.351 1.371 128 3829 1.557 6.857 1.534 1.522 256 477 1.568 7.017 1.589 1.537 So far so good. Looks as if the crossover point for the inline and the dgemm version is between 8 and 16, so let us try this: ig25@linux-fd1f:~/Krempel/Bench> gfortran -O2 -funroll-loops -finline-matmul-limit=12 -fexternal-blas bench-3.f90 -lopenblas_serial ig25@linux-fd1f:~/Krempel/Bench> ./a.out Size Loops Matmul dgemm Matmul Matmul fixed explicitassumed variable explicit = 220 11.948 0.039 0.156 0.464 420 1.999 0.305 0.542 0.859 820 2.435 1.359 0.962 1.255 1620 0.802 3.102 0.798 0.799 3220 4.878 4.990 4.906 4.906 64 30757 6.045 6.062 5.977 5.968 So, if the user really wants us to call an external BLAS, we had better do so directly and not through our library routines.
[Bug fortran/68600] Inlined MATMUL is too slow.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68600 Thomas Koenig changed: What|Removed |Added CC||tkoenig at gcc dot gnu.org --- Comment #2 from Thomas Koenig --- Created attachment 36867 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36867=edit Modified benchmark
[Bug fortran/68600] Inlined MATMUL is too slow.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68600 --- Comment #1 from Dominique d'Humieres --- Created attachment 36864 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36864=edit Code used for the timings