[Bug fortran/68600] Inlined MATMUL is too slow.

2017-05-08 Thread jvdelisle at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68600

Jerry DeLisle  changed:

   What|Removed |Added

 Status|WAITING |RESOLVED
 Resolution|--- |FIXED

--- Comment #16 from Jerry DeLisle  ---
(In reply to Thomas Koenig from comment #15)
> I think that with the current status, where
> we have -finline-matmul-limit=30 by default, we
> can close this bug.
> 
> Agreed?

Yes, this can be closed.

[Bug fortran/68600] Inlined MATMUL is too slow.

2017-05-07 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68600

Thomas Koenig  changed:

   What|Removed |Added

 Status|NEW |WAITING

--- Comment #15 from Thomas Koenig  ---
I think that with the current status, where
we have -finline-matmul-limit=30 by default, we
can close this bug.

Agreed?

[Bug fortran/68600] Inlined MATMUL is too slow.

2016-11-04 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68600

--- Comment #14 from Thomas Koenig  ---
Question: Would it make sense to add an option so that only
matrices with size known at compile-time are inlined?

Somethin like

-finline-matmul-size-var=0 (to disable), -finline-matmul-size-fixed=5
(to inline for fixed size up to 5 only).

[Bug fortran/68600] Inlined MATMUL is too slow.

2016-04-10 Thread jvdelisle at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68600

--- Comment #13 from Jerry DeLisle  ---
(In reply to Thomas Koenig from comment #12)
> (In reply to Jerry DeLisle from comment #11)

---snip--
> 
> May I suggest reading the docs? ;-)
> 
--- snip ---

>  The default value for N is the value specified for
>  `-fblas-matmul-limit' if this option is specified, or unlimitited
>  otherwise.

Oh gosh!, Sorry about that Thomas. I just did not see it.  And I was even
looking for it because I thought it was there! This is excellent because I am
working on a modification to the run-time libraries. This will give us
something like:

 
 Matmul  
 fixed   
 Size  Loops explicit   NewMatmul
 
   16  2000  1.496  1.719
   32  2000  2.427  1.784
   64  2000  1.343  1.967
  128  2000  1.657  2.113
  256   477  2.660  2.185
  51259  2.027  2.195
 1024 7  1.530  2.208
 2048 1  1.516  2.210

On this particular machine, the inlining at high levels of optimization has
some sweet spots at size of 32 x 32 for example, so allowing the tuning is
essential depending on users application.

[Bug fortran/68600] Inlined MATMUL is too slow.

2016-04-10 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68600

--- Comment #12 from Thomas Koenig  ---
(In reply to Jerry DeLisle from comment #11)

> I was experimenting some more here a few days ago.  I really think that
> inlineing shold be disabled above some threshold.  On larger arrays, the
> runtime library outperforms inline and right now by default the runtime
> routines are never used unless you provide -fno-frontend-optimize which is
> counter intuitive for the larger arrays.

May I suggest reading the docs? ;-)

`-finline-matmul-limit=N'
 When front-end optimiztion is active, some calls to the `MATMUL'
 intrinsic function will be inlined.  This may result in code size
 increase if the size of the matrix cannot be determined at compile
 time, as code for both cases is generated.  Setting
 `-finline-matmul-limit=0' will disable inlining in all cases.
 Setting this option with a value of N will produce inline code for
 matrices with size up to N. If the matrices involved are not
 square, the size comparison is performed using the geometric mean
 of the dimensions of the argument and result matrices.

> If one compiles with -march=native -mavx -Ofast etc etc, the inline can do
> fairly well on the larger, however when we update the runtime routines to
> something like shown in comment #8 it will make even more sense to not do
> inline all the time. (Unless of course we further optimize the
> frontend-optimize to do better.)

We can give this option a reasonable default value.  The current
status is

 The default value for N is the value specified for
 `-fblas-matmul-limit' if this option is specified, or unlimitited
 otherwise.

[Bug fortran/68600] Inlined MATMUL is too slow.

2016-04-09 Thread jvdelisle at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68600

--- Comment #11 from Jerry DeLisle  ---
(In reply to Jerry DeLisle from comment #8)
> Created attachment 36887 [details]
> A faster version
> 
> I took the example code found in
> http://wiki.cs.utexas.edu/rvdg/HowToOptimizeGemm/ where the register based
> vector computations are explicitly called via the SSE registers and
> converted it to use the builtin gcc vector extensions.  I had to experiment
> a little to get some of the equivalent operations of the original code.
> 
> With only -O2 and march=native I am getting good results. I need to roll
> this into the other test program yet to confirm the gflops are being
> computed correctly.  The diff value is comparing to the reference naive
> results to check the computation is correct.
> 
> MY_MMult = [
> Size: 40, Gflops: 1.828571e+00, Diff: 2.664535e-15 
> Size: 80, Gflops: 3.696751e+00, Diff: 7.105427e-15 
> Size: 120, Gflops: 4.051583e+00, Diff: 1.065814e-14 
> Size: 160, Gflops: 4.015686e+00, Diff: 1.421085e-14 
> Size: 200, Gflops: 4.029212e+00, Diff: 2.131628e-14 
> Size: 240, Gflops: 3.972414e+00, Diff: 2.486900e-14 
> Size: 280, Gflops: 3.881188e+00, Diff: 2.842171e-14 
> Size: 320, Gflops: 3.872371e+00, Diff: 3.552714e-14 
> Size: 360, Gflops: 3.887676e+00, Diff: 4.973799e-14 
> Size: 400, Gflops: 3.862052e+00, Diff: 4.973799e-14 
> Size: 440, Gflops: 3.886575e+00, Diff: 4.973799e-14 
> Size: 480, Gflops: 3.910124e+00, Diff: 6.039613e-14 
> Size: 520, Gflops: 3.863706e+00, Diff: 6.394885e-14 
> Size: 560, Gflops: 3.976947e+00, Diff: 6.750156e-14 
> Size: 600, Gflops: 4.002631e+00, Diff: 7.460699e-14 
> Size: 640, Gflops: 3.992507e+00, Diff: 8.171241e-14 
> Size: 680, Gflops: 3.964570e+00, Diff: 9.237056e-14 
> Size: 720, Gflops: 3.973661e+00, Diff: 1.101341e-13 
> Size: 760, Gflops: 3.982346e+00, Diff: 1.065814e-13 
> Size: 800, Gflops: 3.869291e+00, Diff: 8.881784e-14 
> Size: 840, Gflops: 3.936271e+00, Diff: 1.065814e-13 
> Size: 880, Gflops: 3.931259e+00, Diff: 1.030287e-13 
> Size: 920, Gflops: 3.912907e+00, Diff: 1.207923e-13 
> Size: 960, Gflops: 3.938391e+00, Diff: 1.278977e-13 
> Size: 1000, Gflops: 3.945754e+00, Diff: 1.421085e-13

(In reply to Dominique d'Humieres from comment #10)
> > I think you are seeing the effects of inefficiencies of assumed-shape 
> > arrays.
> >
> > If you want to use matmul on very small matrix sizes, it is best to
> > use fixed-size explicit arrays.
> 
> Then IMO the matmul inlining should be restricted to fixed-size explicit
> arrays. Could this be done before the release of gcc-6?

I was experimenting some more here a few days ago.  I really think that
inlineing shold be disabled above some threshold.  On larger arrays, the
runtime library outperforms inline and right now by default the runtime
routines are never used unless you provide -fno-frontend-optimize which is
counter intuitive for the larger arrays.

If one compiles with -march=native -mavx -Ofast etc etc, the inline can do
fairly well on the larger, however when we update the runtime routines to
something like shown in comment #8 it will make even more sense to not do
inline all the time. (Unless of course we further optimize the
frontend-optimize to do better.)

[Bug fortran/68600] Inlined MATMUL is too slow.

2016-04-09 Thread dominiq at lps dot ens.fr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68600

--- Comment #10 from Dominique d'Humieres  ---
> I think you are seeing the effects of inefficiencies of assumed-shape arrays.
>
> If you want to use matmul on very small matrix sizes, it is best to
> use fixed-size explicit arrays.

Then IMO the matmul inlining should be restricted to fixed-size explicit
arrays. Could this be done before the release of gcc-6?

[Bug fortran/68600] Inlined MATMUL is too slow.

2015-12-02 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68600

--- Comment #9 from Thomas Koenig  ---

> I took the example code found in
> http://wiki.cs.utexas.edu/rvdg/HowToOptimizeGemm/ where the register based
> vector computations are explicitly called via the SSE registers and
> converted it to use the builtin gcc vector extensions.  I had to experiment
> a little to get some of the equivalent operations of the original code.

Nice one, I think we can use this as a basis for a new library
function in 7.1.

It probably will be necessary to add some zero-padding in some places
where the matix dimensions are not divisible by four.

[Bug fortran/68600] Inlined MATMUL is too slow.

2015-12-01 Thread jvdelisle at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68600

--- Comment #8 from Jerry DeLisle  ---
Created attachment 36887
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36887=edit
A faster version

I took the example code found in
http://wiki.cs.utexas.edu/rvdg/HowToOptimizeGemm/ where the register based
vector computations are explicitly called via the SSE registers and converted
it to use the builtin gcc vector extensions.  I had to experiment a little to
get some of the equivalent operations of the original code.

With only -O2 and march=native I am getting good results. I need to roll this
into the other test program yet to confirm the gflops are being computed
correctly.  The diff value is comparing to the reference naive results to check
the computation is correct.

MY_MMult = [
Size: 40, Gflops: 1.828571e+00, Diff: 2.664535e-15 
Size: 80, Gflops: 3.696751e+00, Diff: 7.105427e-15 
Size: 120, Gflops: 4.051583e+00, Diff: 1.065814e-14 
Size: 160, Gflops: 4.015686e+00, Diff: 1.421085e-14 
Size: 200, Gflops: 4.029212e+00, Diff: 2.131628e-14 
Size: 240, Gflops: 3.972414e+00, Diff: 2.486900e-14 
Size: 280, Gflops: 3.881188e+00, Diff: 2.842171e-14 
Size: 320, Gflops: 3.872371e+00, Diff: 3.552714e-14 
Size: 360, Gflops: 3.887676e+00, Diff: 4.973799e-14 
Size: 400, Gflops: 3.862052e+00, Diff: 4.973799e-14 
Size: 440, Gflops: 3.886575e+00, Diff: 4.973799e-14 
Size: 480, Gflops: 3.910124e+00, Diff: 6.039613e-14 
Size: 520, Gflops: 3.863706e+00, Diff: 6.394885e-14 
Size: 560, Gflops: 3.976947e+00, Diff: 6.750156e-14 
Size: 600, Gflops: 4.002631e+00, Diff: 7.460699e-14 
Size: 640, Gflops: 3.992507e+00, Diff: 8.171241e-14 
Size: 680, Gflops: 3.964570e+00, Diff: 9.237056e-14 
Size: 720, Gflops: 3.973661e+00, Diff: 1.101341e-13 
Size: 760, Gflops: 3.982346e+00, Diff: 1.065814e-13 
Size: 800, Gflops: 3.869291e+00, Diff: 8.881784e-14 
Size: 840, Gflops: 3.936271e+00, Diff: 1.065814e-13 
Size: 880, Gflops: 3.931259e+00, Diff: 1.030287e-13 
Size: 920, Gflops: 3.912907e+00, Diff: 1.207923e-13 
Size: 960, Gflops: 3.938391e+00, Diff: 1.278977e-13 
Size: 1000, Gflops: 3.945754e+00, Diff: 1.421085e-13

[Bug fortran/68600] Inlined MATMUL is too slow.

2015-11-30 Thread dominiq at lps dot ens.fr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68600

Dominique d'Humieres  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2015-11-30
 Ever confirmed|0   |1

--- Comment #6 from Dominique d'Humieres  ---
> I think you are seeing the effects of inefficiencies of assumed-shape arrays.
>
> If you want to use matmul on very small matrix sizes, it is best to
> use fixed-size explicit arrays.

Well, the problem is that MATMUL inlining is the default. IMO it should be
restricted to fixed-size explicit arrays (and small matrices?), at least for
the 6.1 version.

> Created attachment 36869 [details]
> Thomas program with a modified dgemm.
>
> The dgemm in this example is a stripped out version of an "optimized for 
> cache"
> version from netlib.org.  I stripped out a lot of the unused code.

It is probably too late for 6.1, but the results are quite impressive
(~30Gflops/s peak):

[Book15] f90/bug% gfc -Ofast timing/matmul_sys_8jd.f90
[Book15] f90/bug% a.out
 Size Loops  Matmul   dgemm Matmul  Matmul
  fixed explicitassumed  variable
explicit

=
220   0.969   0.104   0.360   0.368
420   5.821   0.774   1.381   1.049
820   5.415   2.970   2.316   2.342
   1620   6.455   4.917   2.738   3.225
   3220   7.332   5.964   2.893   4.117
   64 30757   5.565   7.277   2.785   3.830
  128  3829   4.790   7.982   2.981   4.384
  256   477   4.674   8.375   3.077   4.675
  51259   4.797   8.200   3.156   4.786
 1024 7   3.967   8.370   2.896   4.050
 2048 1   3.693   8.414   2.804   3.650
[Book15] f90/bug% gfc -Ofast -mavx timing/matmul_sys_8jd.f90
[Book15] f90/bug% a.out
 Size Loops  Matmul   dgemm Matmul  Matmul
  fixed explicitassumed  variable
explicit

=
220   0.956   0.106   0.372   0.469
420   7.805   0.715   1.334   1.462
820   7.520   3.222   2.292   3.482
   1620   3.001   6.406   2.671   4.917
   3220   8.886   8.530   2.900   6.136
   64 30757  10.203  10.998   2.677   6.770
  128  3829   6.742  13.367   2.831   6.774
  256   477   6.435  13.979   2.906   6.049
  51259   6.592  15.041   2.991   6.273
 1024 7   5.247  14.639   2.775   4.922
 2048 1   4.309  13.976   2.739   4.176

Note a problem when 16x16 matrices are inlined with -mavx (I'll investigate and
file a PR for it).

[Bug fortran/68600] Inlined MATMUL is too slow.

2015-11-30 Thread Joost.VandeVondele at mat dot ethz.ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68600

Joost VandeVondele  changed:

   What|Removed |Added

 CC||Joost.VandeVondele at mat dot 
ethz
   ||.ch

--- Comment #7 from Joost VandeVondele  
---
(In reply to Dominique d'Humieres from comment #6)
> Note a problem when 16x16 matrices are inlined with -mavx (I'll investigate
> and file a PR for it).

that's a good find!

I ran locally on haswell, and find these numbers, including openblas, and
libxsmm. 

./a.out
 Size Loops  Matmul   newmatmul dgemm-like 
dgemm
  fixed explicit  internal  libxsmm 
openblas

=
220   1.562   0.107   0.104   0.139
420   6.781   0.779   1.012   0.887
820   7.424   3.360   6.150   4.732
   1620   2.954   7.290  14.421  11.527
   3220  10.401  10.251  24.396  18.071
   64 30757  12.696  14.196  27.385  24.547
  128  3829   8.646  17.684  31.460  31.530
  256   477   7.834  19.123  37.457  37.471
  51259   8.064  19.473  40.738  40.755
 1024 7   8.334  19.475  40.931  41.112
 2048 1   3.042  19.157  41.225  41.279


so the 'newmatmul' code gets about 50% of peak. Inlined matmul is good up to
size 8/16, 16-64 libxsmm wins, >64 openblas is better. For the small sizes it
is mostly related to call eliminated overhead, I think.

[Bug fortran/68600] Inlined MATMUL is too slow.

2015-11-29 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68600

--- Comment #3 from Thomas Koenig  ---
Created attachment 36868
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36868=edit
Modified benchmark (really this time)

Hi Dominique,

I think you are seeing the effects of inefficiencies of assumed-shape arrays.

If you want to use matmul on very small matrix sizes, it is best to
use fixed-size explicit arrays.

Below the results of the modified benchmark (some changes to keep
the optimizer honest, such as a call to a dummy subroutine) on my
rather dated home box:

 Size Loops  Matmul   dgemm Matmul  Matmul
  fixed explicitassumed  variable
explicit

=
220  11.948   0.072   0.142   0.411
420   1.711   0.417   0.534   0.861
820   2.314   0.953   0.858   1.076
   1620   1.745   1.276   0.918   1.000
   3220   1.459   1.456   1.371   1.436
   64 30757   1.501   1.440   1.360   1.393
  128  3829   1.586   1.544   1.557   1.529
  256   477   1.531   1.519   1.544   1.507
  51259   1.315   1.290   1.263   1.231
 1024 7   1.110   1.081   1.069   1.053
 2048 1   1.095   1.086   1.081   1.058

[Bug fortran/68600] Inlined MATMUL is too slow.

2015-11-29 Thread jvdelisle at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68600

Jerry DeLisle  changed:

   What|Removed |Added

 CC||jvdelisle at gcc dot gnu.org

--- Comment #4 from Jerry DeLisle  ---
Created attachment 36869
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36869=edit
Thomas program with a modified dgemm.

The dgemm in this example is a stripped out version of an "optimized for cache"
version from netlib.org.  I stripped out a lot of the unused code.

Results show better performance for larger arrays.  Maybe we could model the
library routines after this and invoke for larger arrays.

 Size Loops  Matmul   dgemm Matmul  Matmul
  fixed explicitassumed  variable
explicit
 ==
220   1.752   0.042   0.124   0.295
420   2.172   0.314   0.434   0.704
820   2.293   1.071   0.721   1.127
   1620   2.826   1.533   0.972   1.468
   3220   2.707   1.666   1.184   2.154
   64 30757   2.726   1.853   1.192   2.299
  128  3829   2.641   1.965   1.379   2.542
  256   477   2.661   2.001   1.384   2.594
  51259   1.740   2.011   1.147   1.746
 1024 7   1.344   2.024   1.070   1.355
 2048 1   1.305   2.026   1.088   1.312

[Bug fortran/68600] Inlined MATMUL is too slow.

2015-11-29 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68600

--- Comment #5 from Thomas Koenig  ---
Another interesting data point.  I deleted the DGEMM implementation from
the file and linked against the serial version of openblas. OK,
openblas is based on GOTO blas, so we have to expect a hit
for large matrices.

Figures:

ig25@linux-fd1f:~/Krempel/Bench> gfortran -O2 -funroll-loops  bench-3.f90
-lopenblas_serial
ig25@linux-fd1f:~/Krempel/Bench> ./a.out
 Size Loops  Matmul   dgemm Matmul  Matmul
  fixed explicitassumed  variable
explicit

=
220  11.944   0.035   0.136   0.412
420   1.712   0.257   0.458   0.738
820   2.080   1.162   0.824   1.077
   1620   1.697   3.104   0.939   0.995
   3220   1.450   4.814   1.388   1.426
   64 30757   1.485   5.978   1.351   1.371
  128  3829   1.557   6.857   1.534   1.522
  256   477   1.568   7.017   1.589   1.537

So far so good.  Looks as if the crossover point for the inline and the dgemm 
version is between 8 and 16, so let us try this:

ig25@linux-fd1f:~/Krempel/Bench> gfortran -O2 -funroll-loops
-finline-matmul-limit=12 -fexternal-blas bench-3.f90 -lopenblas_serial
ig25@linux-fd1f:~/Krempel/Bench> ./a.out
 Size Loops  Matmul   dgemm Matmul  Matmul
  fixed explicitassumed  variable
explicit

=
220  11.948   0.039   0.156   0.464
420   1.999   0.305   0.542   0.859
820   2.435   1.359   0.962   1.255
   1620   0.802   3.102   0.798   0.799
   3220   4.878   4.990   4.906   4.906
   64 30757   6.045   6.062   5.977   5.968

So, if the user really wants us to call an external BLAS, we had better
do so directly and not through our library routines.

[Bug fortran/68600] Inlined MATMUL is too slow.

2015-11-29 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68600

Thomas Koenig  changed:

   What|Removed |Added

 CC||tkoenig at gcc dot gnu.org

--- Comment #2 from Thomas Koenig  ---
Created attachment 36867
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36867=edit
Modified benchmark

[Bug fortran/68600] Inlined MATMUL is too slow.

2015-11-28 Thread dominiq at lps dot ens.fr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68600

--- Comment #1 from Dominique d'Humieres  ---
Created attachment 36864
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36864=edit
Code used for the timings