[Bug libfortran/51119] MATMUL slow for large matrices

2017-05-29 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
Bug 51119 depends on bug 37131, which changed state.

Bug 37131 Summary: inline matmul for small matrix sizes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=37131

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

[Bug libfortran/51119] MATMUL slow for large matrices

2017-05-08 Thread jvdelisle at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
Bug 51119 depends on bug 68600, which changed state.

Bug 68600 Summary: Inlined MATMUL is too slow.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68600

   What|Removed |Added

 Status|WAITING |RESOLVED
 Resolution|--- |FIXED

[Bug libfortran/51119] MATMUL slow for large matrices

2017-02-26 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

--- Comment #49 from Thomas Koenig  ---
Author: tkoenig
Date: Sun Feb 26 13:22:43 2017
New Revision: 245745

URL: https://gcc.gnu.org/viewcvs?rev=245745=gcc=rev
Log:
2017-02-26  Thomas Koenig  

PR fortran/51119
* options.c (gfc_post_options): Set default limit for matmul
inlining to 30.
* invoke.texi: Document change.

2017-02-26  Thomas Koenig  

PR fortran/51119
* gfortran.dg/inline_matmul_1.f90: Scan optimized dump instead
of original.
* gfortran.dg/inline_matmul_11.f90: Likewise.
* gfortran.dg/inline_matmul_9.f90: Likewise.
* gfortran.dg/matmul_13.f90: New test.
* gfortran.dg/matmul_14.f90: New test.


Added:
trunk/gcc/testsuite/gfortran.dg/matmul_13.f90
trunk/gcc/testsuite/gfortran.dg/matmul_14.f90
Modified:
trunk/gcc/fortran/ChangeLog
trunk/gcc/fortran/invoke.texi
trunk/gcc/fortran/options.c
trunk/gcc/testsuite/ChangeLog
trunk/gcc/testsuite/gfortran.dg/inline_matmul_1.f90
trunk/gcc/testsuite/gfortran.dg/inline_matmul_11.f90
trunk/gcc/testsuite/gfortran.dg/inline_matmul_9.f90

[Bug libfortran/51119] MATMUL slow for large matrices

2016-12-22 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
Bug 51119 depends on bug 66189, which changed state.

Bug 66189 Summary: Block loops for inline matmul
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66189

   What|Removed |Added

 Status|WAITING |RESOLVED
 Resolution|--- |WONTFIX

[Bug libfortran/51119] MATMUL slow for large matrices

2016-12-03 Thread jvdelisle at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

Jerry DeLisle  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #48 from Jerry DeLisle  ---
I think we can close this now.

[Bug libfortran/51119] MATMUL slow for large matrices

2016-11-16 Thread jvdelisle at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

--- Comment #47 from Jerry DeLisle  ---
Author: jvdelisle
Date: Wed Nov 16 21:54:25 2016
New Revision: 242518

URL: https://gcc.gnu.org/viewcvs?rev=242518=gcc=rev
Log:
2016-11-16  Jerry DeLisle  

PR libgfortran/51119
* Makefile.am: Remove -fno-protect-parens -fstack-arrays.
* Makefile.in: Regenerate.

Modified:
trunk/libgfortran/ChangeLog
trunk/libgfortran/Makefile.am
trunk/libgfortran/Makefile.in

[Bug libfortran/51119] MATMUL slow for large matrices

2016-11-16 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

--- Comment #46 from Thomas Koenig  ---
(In reply to Jerry DeLisle from comment #44)

> Yes I am aware of these. I was willing to live with them, but if it is a
> problem, we can remove those options easy enough.

I think it is no big deal, but on the whole I would prefer
not to have the warnings.

So, please go ahead and remove these options. The patch to do so
is either pre-approved or obvious and simple; it is your choice :-)

[Bug libfortran/51119] MATMUL slow for large matrices

2016-11-16 Thread dominiq at lps dot ens.fr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

--- Comment #45 from Dominique d'Humieres  ---
I have some tests coming from pr37131 which now fail due to too stringent
comparisons between REAL. This illustrated by the following test

program main
  implicit none
  integer, parameter :: factor=100
  integer, parameter :: n = 2*factor, m = 3*factor, count = 4*factor
  real :: a(m, count), b(count,n)
  real, dimension(m,n) :: c1, c2, c3, c4, c5
  real :: at(count,m), bt(n,count)
  real :: c_t(n)
  integer :: i,j,k
  call random_number(a)
  call random_number(b)
  at = transpose(a)
  bt = transpose(b)

  c1 = matmul(a,b)

  c3 = 0
  do i=1,m
 do j=1,n
do k=1, count
   c3(i,j) = c3(i,j) + at(k,i)*b(k,j)
end do
 end do
  end do
  print *, maxval(abs(c3/c1-1)/epsilon(c1))
  if (any(abs(c3/c1-1) > sqrt(real(n))*epsilon(c1))) call abort

end program main

For which maxval(abs(c3/c1-1)/epsilon(c1)) fluctuates around 11 with
-ffrontend-optimize and around 3 with -fno-frontend-optimize. The original test
was abs(c3-c1) > 1e-5 with c1~100.

Although I don't think there is anything wrong with these results, it may be
worth some further investigations.

[Bug libfortran/51119] MATMUL slow for large matrices

2016-11-16 Thread jvdelisle at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

--- Comment #44 from Jerry DeLisle  ---
(In reply to Janne Blomqvist from comment #43)
> Compile warnings caused by this patch:
> 
> cc1: warning: command line option ‘-fno-protect-parens’ is valid for Fortran
> but not for C
> cc1: warning: command line option ‘-fstack-arrays’ is valid for Fortran but
> not for C

Yes I am aware of these. I was willing to live with them, but if it is a
problem, we can remove those options easy enough.

[Bug libfortran/51119] MATMUL slow for large matrices

2016-11-16 Thread jb at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

--- Comment #43 from Janne Blomqvist  ---
Compile warnings caused by this patch:

cc1: warning: command line option ‘-fno-protect-parens’ is valid for Fortran
but not for C
cc1: warning: command line option ‘-fstack-arrays’ is valid for Fortran but not
for C

[Bug libfortran/51119] MATMUL slow for large matrices

2016-11-15 Thread jvdelisle at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

--- Comment #42 from Jerry DeLisle  ---
Author: jvdelisle
Date: Tue Nov 15 23:03:00 2016
New Revision: 242462

URL: https://gcc.gnu.org/viewcvs?rev=242462=gcc=rev
Log:
2016-11-15  Jerry DeLisle  
Thomas Koenig  

PR libgfortran/51119
* Makefile.am: Add new optimization flags matmul.
* Makefile.in: Regenerate.
* m4/matmul.m4: For the case of all strides = 1, implement a
fast blocked matrix multiply. Fix some whitespace.
* generated/matmul_c10.c: Regenerate.
* generated/matmul_c16.c: Regenerate.
* generated/matmul_c4.c: Regenerate.
* generated/matmul_c8.c: Regenerate.
* generated/matmul_i1.c: Regenerate.
* generated/matmul_i16.c: Regenerate.
* generated/matmul_i2.c: Regenerate.
* generated/matmul_i4.c: Regenerate.
* generated/matmul_i8.c: Regenerate.
* generated/matmul_r10.c: Regenerate.
* generated/matmul_r16.c: Regenerate.
* generated/matmul_r4.c: Regenerate.
* generated/matmul_r8.c: Regenerate.

2016-11-15  Thomas Koenig  

PR libgfortran/51119
* gfortran.dg/matmul_12.f90: New test case.

Added:
trunk/gcc/testsuite/gfortran.dg/matmul_12.f90
Modified:
trunk/gcc/testsuite/ChangeLog
trunk/libgfortran/ChangeLog
trunk/libgfortran/Makefile.am
trunk/libgfortran/Makefile.in
trunk/libgfortran/generated/matmul_c10.c
trunk/libgfortran/generated/matmul_c16.c
trunk/libgfortran/generated/matmul_c4.c
trunk/libgfortran/generated/matmul_c8.c
trunk/libgfortran/generated/matmul_i1.c
trunk/libgfortran/generated/matmul_i16.c
trunk/libgfortran/generated/matmul_i2.c
trunk/libgfortran/generated/matmul_i4.c
trunk/libgfortran/generated/matmul_i8.c
trunk/libgfortran/generated/matmul_r10.c
trunk/libgfortran/generated/matmul_r16.c
trunk/libgfortran/generated/matmul_r4.c
trunk/libgfortran/generated/matmul_r8.c
trunk/libgfortran/m4/matmul.m4

[Bug libfortran/51119] MATMUL slow for large matrices

2016-11-14 Thread jvdelisle at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

Jerry DeLisle  changed:

   What|Removed |Added

   Assignee|jb at gcc dot gnu.org  |jvdelisle at gcc dot 
gnu.org

--- Comment #41 from Jerry DeLisle  ---
Created attachment 40039
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40039=edit
A test program I am using for timings

Timing program used for earlier posts.

[Bug libfortran/51119] MATMUL slow for large matrices

2016-11-08 Thread jvdelisle at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

--- Comment #40 from Jerry DeLisle  ---
(In reply to Joost VandeVondele from comment #37)
> (In reply to Joost VandeVondele from comment #36)
> > #pragma GCC optimize ( "-Ofast -fvariable-expansion-in-unroller
> > -funroll-loops" )
>
Using: (I found it necessary to split into separate lines)
#pragma GCC optimize ( "-Ofast" )
#pragma GCC optimize ( "-funroll-loops" )
#pragma GCC optimize ( "-fvariable-expansion-in-unroller" )

$ gfc -static -Ofast -finline-matmul-limit=0 compare.f90 
[jerry@quasar pr51119]$ ./a.out 
 =
 MEASURED GIGAFLOPS  =
 =
 Matmul   Matmul
 fixed Matmul variable
 Size  Loops explicit   refMatmul  assumedexplicit
 =
2  2000  0.055  0.048  0.042  0.055
4  2000  0.366  0.236  0.299  0.368
8  2000  0.628  0.673  1.610  1.833
   16  2000  2.876  2.765  2.821  2.930
   32  2000  4.681  3.382  4.812  4.763
   64  2000  6.742  2.817  6.760  6.764
  128  2000  8.532  3.194  7.852  8.539
  256   477  9.420  3.319  9.053  9.420
  51259  8.435  2.358  8.319  8.390
 1024 7  8.493  1.368  8.379  8.444
 2048 1  8.499  1.666  8.385  8.448

[Bug libfortran/51119] MATMUL slow for large matrices

2016-11-08 Thread jvdelisle at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

--- Comment #39 from Jerry DeLisle  ---
(In reply to Thomas Koenig from comment #38)
> 
> Jerry, what Netlib code were you basing your code on?

http://www.netlib.org/blas/index.html#_level_3_blas_tuned_for_single_processors_with_caches

Used the dgemm version from the download zip to start, stripped it of unneeded
code, converted to C with f2c, and further edits to clean it up.

[Bug libfortran/51119] MATMUL slow for large matrices

2016-11-08 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

--- Comment #38 from Thomas Koenig  ---
(In reply to Joost VandeVondele from comment #37)
> (In reply to Joost VandeVondele from comment #36)
> > #pragma GCC optimize ( "-Ofast -fvariable-expansion-in-unroller
> > -funroll-loops" )
> 
> and really beneficial for larger matrices would be 
> 
> -floop-nest-optimize
> 
> in particular the blocking (it would be an additional motivation for PR14741
> and work on graphite in general), don't know if one can give the parameter
> for the blocking. In principle the loop-nest-optimization, together with the
> -Ofast (and ideally -march=native, which we can't have in libgfortran, I
> assume) would yield near peak performance.

The algorithm that Jerry implemented already has a very nice unrolling/
blocking algorithm.  I doubt that the gcc algorithms can add to that.

Regarding -march=native, that could really be an improvement,
especially with -mavx.  I wonder if it is possible to have
architecture-specific versions of library functions?  We could
select the right routine depending on the -march flag.  Worth
a question on the gcc list, probably (but definitely _not_ a
prerequisite for this going into gcc 7).

Of course, we _could_ also try to bring blocking to the inline
version (PR 66189), risking insanity for the implementer :-)

Jerry, what Netlib code were you basing your code on?

[Bug libfortran/51119] MATMUL slow for large matrices

2016-11-08 Thread Joost.VandeVondele at mat dot ethz.ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

--- Comment #37 from Joost VandeVondele  
---
(In reply to Joost VandeVondele from comment #36)
> #pragma GCC optimize ( "-Ofast -fvariable-expansion-in-unroller
> -funroll-loops" )

and really beneficial for larger matrices would be 

-floop-nest-optimize

in particular the blocking (it would be an additional motivation for PR14741
and work on graphite in general), don't know if one can give the parameter for
the blocking. In principle the loop-nest-optimization, together with the -Ofast
(and ideally -march=native, which we can't have in libgfortran, I assume) would
yield near peak performance.

[Bug libfortran/51119] MATMUL slow for large matrices

2016-11-08 Thread Joost.VandeVondele at mat dot ethz.ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

--- Comment #36 from Joost VandeVondele  
---
(In reply to Jerry DeLisle from comment #34)
> -Ofast does reorder execution.. 
> Opinions welcome.

That is absolutely OK for a matmul, and all techniques to get near peak
performance require that (e.g. use of fma, blocking, etc.). 

I didn't realize that one can easily put pragmas for single routines, so you
could experiment with something like 

#pragma GCC optimize ( "-Ofast -fvariable-expansion-in-unroller -funroll-loops"
)

[Bug libfortran/51119] MATMUL slow for large matrices

2016-11-08 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

--- Comment #35 from Thomas Koenig  ---
(In reply to Jerry DeLisle from comment #34)

> -Ofast does reorder execution.. 

So does a block algorithm.

> Opinions welcome.

I'd say go for -Ofast, or at least its subset that enables
reordering of expressions and thus vectorization.

It might be interesting to check the GOTO BLAS for accuracy, as well.

We could also ask on c.l.f. on what people#s expectations are, but I
shudder to think about the lectures on PL/I that would ensue :-)

[Bug libfortran/51119] MATMUL slow for large matrices

2016-11-07 Thread jvdelisle at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

--- Comment #34 from Jerry DeLisle  ---
Created attachment 39987
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=39987=edit
A test program

Just ran some tests comparing reference results and results using -Ofast.

-Ofast does reorder execution.. 

Using kind=8

DELTA = (reference - result) and looking at the ranges of minval and maxval
with several random matrices one sees (4 cases):

Using some smaller matrices. see attachment for test program.

 delta minval  delta maxval

-2.2204460492503131E-016   2.2204460492503131E-016

-4.4408920985006262E-016   4.4408920985006262E-016

-2.2204460492503131E-016   2.2204460492503131E-016

-4.4408920985006262E-016   2.2204460492503131E-016

So the wiggling one gets in the least significant bits may matter to some.

With larger arrays (error amplification):

1000 x 1000-1.0231815394945443E-012   8.8107299234252423E-013

2000 x 2000-2.7284841053187847E-012   2.6716406864579767E-012

Compiling the test program with -Ofast so that the reference method has the
same optimization as MATMUL, one gets.

2000 x 2000 0.0.

Opinions welcome.

[Bug libfortran/51119] MATMUL slow for large matrices

2016-11-07 Thread jvdelisle at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

--- Comment #33 from Jerry DeLisle  ---
With #pragma GCC optimize ( "-O3" )

$ gfc -static -O2 -finline-matmul-limit=0 compare.f90 
$ ./a.out 
 =
 MEASURED GIGAFLOPS  =
 =
 Matmul   Matmul
 fixed Matmul variable
 Size  Loops explicit   refMatmul  assumedexplicit
 =
2  2000  0.055  0.051  0.038  0.056
4  2000  0.408  0.274  0.318  0.408
8  2000  0.644  0.711  1.287  1.831
   16  2000  2.507  2.591  2.521  2.579
   32  2000  3.573  2.300  3.506  3.573
   64  2000  4.628  2.196  4.462  4.629
  128  2000  5.030  2.393  5.304  5.054
  256   477  4.802  2.367  5.573  4.854
  51259  3.907  1.856  5.234  4.035
 1024 7  3.891  1.178  5.222  4.022
 2048 1  3.901  1.500  5.238  4.033

and with no #pragma it is better than the -O3 version

$ gfc -static -O2 -finline-matmul-limit=0 compare.f90 
$ ./a.out 
 =
 MEASURED GIGAFLOPS  =
 =
 Matmul   Matmul
 fixed Matmul variable
 Size  Loops explicit   refMatmul  assumedexplicit
 =
2  2000  0.054  0.052  0.043  0.057
4  2000  0.397  0.281  0.316  0.414
8  2000  0.691  0.773  1.831  1.995
   16  2000  2.493  2.691  2.521  2.512
   32  2000  3.629  2.301  3.623  3.572
   64  2000  4.557  2.072  4.568  4.468
  128  2000  5.282  2.387  5.291  5.284
  256   477  5.629  2.369  5.620  5.605
  51259  5.215  1.874  5.240  5.216
 1024 7  5.212  1.174  5.217  5.217
 2048 1  5.230  1.499  5.234  5.229

Still a good improvement over gfortran6 on the larger matrices.

[Bug libfortran/51119] MATMUL slow for large matrices

2016-11-07 Thread jvdelisle at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

--- Comment #32 from Jerry DeLisle  ---
Created attachment 39985
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=39985=edit
Proposed patch to get testing going

This patch works pretty good for me. My results are as follows:

gfortran version 6:

$ gfc6 -static -O2 -finline-matmul-limit=0 compare.f90 
[jerry@quasar pr51119]$ ./a.out 
 =
 MEASURED GIGAFLOPS  =
 =
 Matmul   Matmul
 fixed Matmul variable
 Size  Loops explicit   refMatmul  assumedexplicit
 =
2  2000  0.086  0.054  0.060  0.098
4  2000  0.288  0.302  0.256  0.315
8  2000  0.799  0.830  2.094  2.246
   16  2000  4.045  2.539  4.198  4.266
   32  2000  5.358  2.301  5.340  5.335
   64  2000  5.411  2.207  5.391  5.395
  128  2000  5.918  2.416  5.919  5.915
  256   477  5.871  2.393  5.870  5.869
  51259  2.927  1.891  2.927  2.928
 1024 7  1.668  1.182  1.667  1.668
 2048 1  1.763  1.526  1.763  1.763

gfortran version 7:

$ gfc -static -O2 -finline-matmul-limit=0 compare.f90 
[jerry@quasar pr51119]$ ./a.out 
 =
 MEASURED GIGAFLOPS  =
 =
 Matmul   Matmul
 fixed Matmul variable
 Size  Loops explicit   refMatmul  assumedexplicit
 =
2  2000  0.053  0.052  0.043  0.054
4  2000  0.310  0.304  0.277  0.377
8  2000  0.704  0.858  1.711  1.758
   16  2000  2.805  2.529  2.798  2.780
   32  2000  4.693  2.210  4.700  4.821
   64  2000  6.768  2.038  6.732  6.782
  128  2000  8.550  2.419  8.647  8.595
  256   477  9.442  2.378  9.425  9.446
  51259  8.565  1.960  8.641  8.568
 1024 7  8.537  1.178  8.610  8.530
 2048 1  8.576  1.512  8.652  8.582

A portion of the speed up is from using:

#pragma GCC optimize ( "-Ofast" ) which I just discovered. I am thinking
addition and subtraction are fairly safe with this option, however I do not
know if it is acceptable for release since it may contradict somewhere on some
platform or even a gcc policy. But hey it workd for me.

Much testing needed. There is a nice sweet spot at 256. This is on a single
thread on 3.8 GHz core.

[Bug libfortran/51119] MATMUL slow for large matrices

2015-11-28 Thread dominiq at lps dot ens.fr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

--- Comment #31 from Dominique d'Humieres  ---
From comment 27

> > I agree that inline should be faster, if the compiler is reasonably smart,
> > if the matrix dimensions are known at compile time (i.e. should be able to
> > generate the same kernel). I haven't checked yet.
>
> If the compiler turns out not to be reasonably smart, file a bug report :-)

PR68600.

[Bug libfortran/51119] MATMUL slow for large matrices

2015-11-24 Thread jvdelisle at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

--- Comment #30 from Jerry DeLisle  ---
(In reply to Joost VandeVondele from comment #29)

> These slides show how to reach 90% of peak:
> http://wiki.cs.utexas.edu/rvdg/HowToOptimizeGemm/
> the code actually is not too ugly, and I think there is no need for the
> explicit vector intrinsics with gcc.

The 90% of peak is achieved using SSE registers.  I went ahead and built the
example and on my laptop (the slow machine) I get about 4.8 gflops with a
single core.  So we could use this example and back-off from the SSE
optimizations to get an internal MATMUL that is not architecture dependent and
perhaps leave the rest to external optimized BLAS.

[Bug libfortran/51119] MATMUL slow for large matrices

2015-11-24 Thread Joost.VandeVondele at mat dot ethz.ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

--- Comment #29 from Joost VandeVondele  
---
(In reply to Thomas Koenig from comment #27)
> (In reply to Joost VandeVondele from comment #22)
> If the compiler turns out not to be reasonably smart, file a bug report :-)

what is needed for large matrices (in my opinion) is some simple loop tiling,
as can, in principle, be achieved with graphite : this is my PR14741

Good vectorization, which gcc already does well, just requires the proper
compiler options for the matmul implementation, i.e. '-O3 -march=native
-ffast-math'. However, this would require the Fortran runtime to be compiled
with such options, or at least a way to provide specialized (avx2 etc)
routines.

There is however the related PR (inner loop of matmul) : PR25621, where some
unusual flag combo helps (-fvariable-expansion-in-unroller -funroll-loops)

I think external blas and inlining of small matmuls are good things, but I
would expect the default library implementation to reach at least 50% of peak
(for e.g. a 4000x4000 matrix), which is not all that hard. Actually, would be
worth an experiment, but a Fortran loop nest which implements a matmul compiled
with ifort would presumably reach that or higher :-).

These slides show how to reach 90% of peak:
http://wiki.cs.utexas.edu/rvdg/HowToOptimizeGemm/
the code actually is not too ugly, and I think there is no need for the
explicit vector intrinsics with gcc.

I believe I had once a bug report open for small matrices, but this might have
been somewhat fixed in the meanwhile.

[Bug libfortran/51119] MATMUL slow for large matrices

2015-11-24 Thread jvdelisle at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

--- Comment #28 from Jerry DeLisle  ---
(In reply to Janne Blomqvist from comment #25)

> 
> But, that is not particularly impressive, is it? I don't know about current
> low end graphics adapters, but at least the high end GPU cards (Tesla) are
> capable of several Tflops. Of course, there is a non-trivial threshold size
> to amortize the data movement to/from the GPU.

Not even a graphics card, just the on system chip on a low end laptop. Not
trying to impress, just pointing out that the hardware acceleration is fairly
ubiquitous these days, so why not just use it.  Maybe not important for serious
computing where users already have things like your 20 core machine.
> 
> With the test program from #12, with OpenBLAS (which BTW should be available
> in Fedora 22 as well) I get 337 Gflops/s, or 25 Gflops/s if I restrict it to
> a single core with the OMP_NUM_THREADS=1 environment variable. This on a
> machine with 20 2.8 GHz Ivy bridge cores.
> 
> I'm not per se against using GPU's, but I think there's a lot of low hanging
> fruit to be had just by making it easier for users to use a high performance
> BLAS implementation.

I agree, if available external BLAS does what is needed very good, What I am
exploring is one of those external BLAS libraries that uses GPU.  Maybe the
answer to this PR is "use an external BLAS" and close this PR.

[Bug libfortran/51119] MATMUL slow for large matrices

2015-11-24 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

--- Comment #27 from Thomas Koenig  ---
(In reply to Joost VandeVondele from comment #22)

> I agree that inline should be faster, if the compiler is reasonably smart,
> if the matrix dimensions are known at compile time (i.e. should be able to
> generate the same kernel). I haven't checked yet.

If the compiler turns out not to be reasonably smart, file a bug report :-)

[Bug libfortran/51119] MATMUL slow for large matrices

2015-11-24 Thread jb at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

--- Comment #25 from Janne Blomqvist  ---
(In reply to Jerry DeLisle from comment #24)
> (In reply to Jerry DeLisle from comment #16)
> > For what its worth:
> > 
> > $ gfc pr51119.f90 -lblas -fno-external-blas -Ofast -march=native 
> > $ ./a.out 
> >  Time, MATMUL:21.2483196   21.25444964601 1.5055670945599979
> > 
> >  Time, dgemm:33.2441711   33.24308728902  .96260614189671445
> > 
> 
> Running a sample matrix multiply program on this same platform using the
> default OpenCL (Mesa on Fedora 22) the machine is achieving:
> 
> 64 x 64  2.76 Gflops
> 1000 x 1000  14.10
> 2000 x 2000  24.4

But, that is not particularly impressive, is it? I don't know about current low
end graphics adapters, but at least the high end GPU cards (Tesla) are capable
of several Tflops. Of course, there is a non-trivial threshold size to amortize
the data movement to/from the GPU.

With the test program from #12, with OpenBLAS (which BTW should be available in
Fedora 22 as well) I get 337 Gflops/s, or 25 Gflops/s if I restrict it to a
single core with the OMP_NUM_THREADS=1 environment variable. This on a machine
with 20 2.8 GHz Ivy bridge cores.

I'm not per se against using GPU's, but I think there's a lot of low hanging
fruit to be had just by making it easier for users to use a high performance
BLAS implementation.

[Bug libfortran/51119] MATMUL slow for large matrices

2015-11-24 Thread jb at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

--- Comment #26 from Janne Blomqvist  ---
(In reply to Thomas Koenig from comment #15)
> Another issue:  What should we do if the user supplies an external
> subroutine DGEMM which does something unrelated?
> 
> I suppose we should then make DGEMM (and SGEMM) an intrinsic subroutine.

Indeed, this is a potential problem. 

Another related problem is that, apparently, framework Accelerate on OSX uses
the f2c ABI. This at least can be worked around by instead using the cblas API
(and hence the C ABI), which most BLAS implementations provide anyway.

[Bug libfortran/51119] MATMUL slow for large matrices

2015-11-23 Thread jvdelisle at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

--- Comment #24 from Jerry DeLisle  ---
(In reply to Jerry DeLisle from comment #16)
> For what its worth:
> 
> $ gfc pr51119.f90 -lblas -fno-external-blas -Ofast -march=native 
> $ ./a.out 
>  Time, MATMUL:21.2483196   21.25444964601 1.5055670945599979
> 
>  Time, dgemm:33.2441711   33.24308728902  .96260614189671445
> 

Running a sample matrix multiply program on this same platform using the
default OpenCL (Mesa on Fedora 22) the machine is achieving:

64 x 64  2.76 Gflops
1000 x 1000  14.10
2000 x 2000  24.4

[Bug libfortran/51119] MATMUL slow for large matrices

2015-11-23 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

--- Comment #21 from Thomas Koenig  ---

> Hidden behind a -fexternal-blas-n switch might be an option. Including GPUs
> seems even a tad more tricky. We have a paper on GPU (small) matrix
> multiplication, http://dbcsr.cp2k.org/_media/gpu_book_chapter_submitted.pdf

Quite interesting what can be done with GPUs...

> . BTW, another interesting project is the libxsmm library more aimed at
> small (<128) matrices see : https://github.com/hfp/libxsmm . Not sure if
> this info is useful in this context, but it might provide inspiration.

I assume that for  small matrices bordering on the silly
(say, a matrix multiplication with dimensions of (1,2) and (2,1))
the inline code will be faster if the code is compiled with the
right options, due to function call overhead.  I also assume that
libxsmm will become faster quite soon for bigger sizes.

Do you have an idea where the crossover is?

[Bug libfortran/51119] MATMUL slow for large matrices

2015-11-23 Thread Joost.VandeVondele at mat dot ethz.ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

--- Comment #22 from Joost VandeVondele  
---
(In reply to Thomas Koenig from comment #21)
> I assume that for  small matrices bordering on the silly
> (say, a matrix multiplication with dimensions of (1,2) and (2,1))
> the inline code will be faster if the code is compiled with the
> right options, due to function call overhead.  I also assume that
> libxsmm will become faster quite soon for bigger sizes.
> 
> Do you have an idea where the crossover is?

I agree that inline should be faster, if the compiler is reasonably smart, if
the matrix dimensions are known at compile time (i.e. should be able to
generate the same kernel). I haven't checked yet.

[Bug libfortran/51119] MATMUL slow for large matrices

2015-11-23 Thread jvdelisle at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

--- Comment #23 from Jerry DeLisle  ---
(In reply to Thomas Koenig from comment #21)
> > Hidden behind a -fexternal-blas-n switch might be an option. Including GPUs
> > seems even a tad more tricky. We have a paper on GPU (small) matrix
> > multiplication, http://dbcsr.cp2k.org/_media/gpu_book_chapter_submitted.pdf
> 
> Quite interesting what can be done with GPUs...
> 

Run of the mill graphics processing units have many floating point compute
cores.  128 cores is not unusual, usually a lot more. These cores perform basic
things like a + b * c on scalars. and other useful functions. Softwares like
OpenCL will compile compute kernels which will run efficiently in parallel on
these GPU architectures. clBLAS is a runtime library which encapsulates this
capability with a BLAS compatible API.  Conceptually you initialize for
particular matrices and hand of the work to the GPU.

My low end laptop (300 dollar variety) is running an nbody 3D model with
several thousand masses without even pressing the CPU as an example. MATMUL
should be doable.

The main GPU competitors are Nvidia, AMD. and Intel. OpenCL is supported on all
three.

[Bug libfortran/51119] MATMUL slow for large matrices

2015-11-22 Thread Joost.VandeVondele at mat dot ethz.ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

--- Comment #20 from Joost VandeVondele  
---
(In reply to Jerry DeLisle from comment #19)
> If I can get something working I am thinking something like
> -fexternal-blas-n, if -n not given then default to current libblas
> behaviour. This way users have some control. With GPUs, it is not unusual to
> have hundreds of cores.  We can also, at run time, see if the opencl is
> already initialized which may mean used elsewhere so don't mess with it.

Hidden behind a -fexternal-blas-n switch might be an option. Including GPUs
seems even a tad more tricky. We have a paper on GPU (small) matrix
multiplication, http://dbcsr.cp2k.org/_media/gpu_book_chapter_submitted.pdf .
BTW, another interesting project is the libxsmm library more aimed at small
(<128) matrices see : https://github.com/hfp/libxsmm . Not sure if this info is
useful in this context, but it might provide inspiration.

[Bug libfortran/51119] MATMUL slow for large matrices

2015-11-22 Thread jvdelisle at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

--- Comment #19 from Jerry DeLisle  ---
If I can get something working I am thinking something like -fexternal-blas-n,
if -n not given then default to current libblas behaviour. This way users have
some control. With GPUs, it is not unusual to have hundreds of cores.  We can
also, at run time, see if the opencl is already initialized which may mean used
elsewhere so don't mess with it.

[Bug libfortran/51119] MATMUL slow for large matrices

2015-11-22 Thread jvdelisle at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

--- Comment #17 from Jerry DeLisle  ---
I have done some experimenting.  Since gcc supports OMP and I think to some
extent ACC why not come up with a MATMUL that exploits these if present?  On
the darwin platform discussed in comment #12, the performance is excellent. 
Does darwin implementation provided exploit OpenCL?  What is it using?  Why not
enable that on other platforms if present.

I am going to explore OpenCL and clBLAS to see if I can get it to work.  If I
am successful, I would like to hide it behind MATMUL if possible.  Any other
opinions?

[Bug libfortran/51119] MATMUL slow for large matrices

2015-11-22 Thread Joost.VandeVondele at mat dot ethz.ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

--- Comment #18 from Joost VandeVondele  
---
(In reply to Jerry DeLisle from comment #17)
> I have done some experimenting.  Since gcc supports OMP and I think to some
> extent ACC why not come up with a MATMUL that exploits these if present?  On
> the darwin platform discussed in comment #12, the performance is excellent. 
> Does darwin implementation provided exploit OpenCL?  What is it using?  Why
> not enable that on other platforms if present.
> 
> I am going to explore OpenCL and clBLAS to see if I can get it to work.  If
> I am successful, I would like to hide it behind MATMUL if possible.  Any
> other opinions?

yes, this is tricky. In a multithreaded code executing matmul, what is the
strategy (nested parallelism, serial, ...) ? We usually link in a serial blas
because threading in the library is usually not good for performance of the
code overall, i.e. nested parallelism tends to perform badly. Also, how many
threads would you use by default (depending on matrix size, machine load) ?
Users on an N core machine might run N jobs in parallel, and not expect those
to start several threads each. 

Maybe, this could be part of the auto-parallelize (or similar) option that gcc
has ?

[Bug libfortran/51119] MATMUL slow for large matrices

2015-11-08 Thread jvdelisle at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

Jerry DeLisle  changed:

   What|Removed |Added

 CC||jvdelisle at gcc dot gnu.org

--- Comment #16 from Jerry DeLisle  ---
For what its worth:

$ gfc pr51119.f90 -lblas -fno-external-blas -Ofast -march=native 
$ ./a.out 
 Time, MATMUL:21.2483196   21.25444964601 1.5055670945599979
 Time, dgemm:33.2441711   33.24308728902  .96260614189671445

This is on a laptop not taking any advantage of a tuned BLAS.  If I replace
-Ofast with -O2 I get:

$ ./a.out 
 Time, MATMUL:43.6199570   43.625358022970.73351833543988521
 Time, dgemm:33.2262650   33.22696145301 0.96307331759072967 

-O3 brings performance back to match with -Ofast. It seems odd to me that -O2
does not do well.

Regardless, the internal MATMUL is doing better than BLAS on this platform, but
1.5 gflops is pretty lame either way.

[Bug libfortran/51119] MATMUL slow for large matrices

2015-11-01 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

--- Comment #15 from Thomas Koenig  ---
Another issue:  What should we do if the user supplies an external subroutine
DGEMM which does something unrelated?

I suppose we should then make DGEMM (and SGEMM) an intrinsic subroutine.

[Bug libfortran/51119] MATMUL slow for large matrices

2015-10-31 Thread jb at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

--- Comment #14 from Janne Blomqvist  ---
(In reply to Dominique d'Humieres from comment #12)
> I suppose most modern OS provide such optimized BLAS and, if not, one can
> install libraries such as atlas. So I wonder if it would not be more
> effective to be able to configure with something such as --with-blas="magic
> incantation" and use -fexternal-blas as the default rather than reinventing
> the wheel.

This matches my current thinking on this subject. 

To get good performance one really needs arch-specific parameters (block sizes
to fit into cache etc.), as well as using arch-specific code to make maximum
use of the vector ISA. Add in threading which is useful for larger matrices,
and there's lot more work than what the current GFortran development team is
able to commit to.

So my idea of what ought to be done:

- Check for the presence of BLAS at compile time. Alternatively, use weak
references so we can always use BLAS if it's available, without the user having
to specify -fexternal-blas (which I guess most user don't).

  - A problem here is what if the system has multiple BLAS libraries, which one
do we choose? And different systems have different ways of linking to BLAS
(e.g. -framework Accelerate on OSX).

  - And what about BLAS64, i.e. BLAS compiler with 64-bit integers. It seems
these libraries have the same API as the "normal" BLAS, so how to figure out at
build time which kind of BLAS library are we using?

- Currently with -fexternal-blas we only use BLAS for stride-1 arrays, falling
back to the current code for stride /= 1. It's probably more efficient to pack
stride /= 1 arrays and then call BLAS. Heck, high performance BLAS libraries
repack blocks to get better cache behavior anyways.


> 
> More than three years ago Janne Blomqvist (comment 7) wrote
> > IIRC I reached about 30-40 % of peak flops which was a bit disappointing.
> 
> Would it be possible to have the patch to play with?

My GCC dev box where I think this stuff might reside is packed down in a box as
I have recently moved. But I'll keep this in mind, and see if I can find the
patch once I get around to unpacking..

As an aside, contrary to when I implemented my patch based on reading the
papers by Goto et al., nowadays there's a nice step-by-step description at

http://apfel.mathematik.uni-ulm.de/~lehn/sghpc/gemm/index.html

[Bug libfortran/51119] MATMUL slow for large matrices

2015-10-31 Thread dominiq at lps dot ens.fr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

--- Comment #12 from Dominique d'Humieres  ---
Some new numbers for a four cores Corei7 2.8Ghz, turboboost 3.8Ghz, 1.6Ghz DDR3
on x86_64-apple-darwin14.5 for the following test

program t2 
implicit none 
REAL time_begin, time_end 
integer, parameter :: n=2000; 
integer(8) :: ts, te, rate8, cmax8
real(8) :: elapsed
REAL(8) :: a(n,n), b(n,n), c(n,n) 
integer, parameter :: m = 100 
integer :: i 
call RANDOM_NUMBER(a) 
call RANDOM_NUMBER(b) 
call cpu_time(time_begin) 
call SYSTEM_CLOCK (ts, rate8, cmax8)
do i = 1,m 
a(1,1) = a(1,1) + 0.1 
c = MATMUL(a,b) 
enddo 
call SYSTEM_CLOCK (te, rate8, cmax8)
call cpu_time(time_end) 
elapsed = real(te-ts, kind=8)/real(rate8, kind=8)
PRINT *, 'Time, MATMUL: ',time_end-time_begin, elapsed , 2*m*real(n,
kind=8)**3/(10**9*elapsed)
call cpu_time(time_begin) 
call SYSTEM_CLOCK (ts, rate8, cmax8)
do i = 1,m 
a(1,1) = a(1,1) + 0.1 
call dgemm('n','n',n, n, n, dble(1.0), a, n, b, n, dble(0.0), c, n) 
enddo 
call SYSTEM_CLOCK (te, rate8, cmax8)
call cpu_time(time_end) 
elapsed = real(te-ts, kind=8)/real(rate8, kind=8)
PRINT *, 'Time, MATMUL: ',time_end-time_begin, elapsed , 2*m*real(n,
kind=8)**3/(10**9*elapsed)
end program 

borrowed from
http://groups.google.com/group/comp.lang.fortran/browse_thread/thread/1cba8e6ce5080197

[Book15] f90/bug% gfc -Ofast timing/matmul_tst_sys.f90 -framework Accelerate
-fno-frontend-optimize
[Book15] f90/bug% time a.out
 Time, MATMUL:374.027161   374.02889900024.2777443247774283 
 Time, MATMUL:172.823853   23.073034069.345019818373260 
546.427u 0.542s 6:37.24 137.6%  0+0k 1+0io 41pf+0w
[Book15] f90/bug% gfc -Ofast timing/matmul_tst_sys.f90 -framework Accelerate 
[Book15] f90/bug% time a.out
 Time, MATMUL:391.495880   391.4940354.0869077353886123 
 Time, MATMUL:169.313202   22.781099170.233661685944114 
560.384u 0.544s 6:54.39 135.3%  0+0k 0+0io 0pf+0w
[Book15] f90/bug% gfc -Ofast timing/matmul_tst_sys.f90 -framework Accelerate
-march=native
[Book15] f90/bug% time a.out
 Time, MATMUL:367.570374   367.5688054.3529265221514102 
 Time, MATMUL:170.150818   22.837544170.060073009602078 
537.306u 0.534s 6:30.53 137.7%  0+0k 0+0io 0pf+0w

where the last column is the speed in Gflops. These numbers show that the
library MATMUL is slightly faster than the inline version unless -march=native
is used (AVX should be twice faster unless limited by the memory bandwidth).

[Book15] f90/bug% gfc -Ofast -fexternal-blas timing/matmul_tst_sys.f90
-framework Accelerate
[Book15] f90/bug% time a.out
 Time, MATMUL:159.000992   21.450851074.589115368896088 
 Time, MATMUL:172.616943   23.029487069.476145951492541 
331.281u 0.453s 0:44.60 743.7%  0+0k 0+0io 3pf+0w
... repeated several time in order to heat the CPU
[Book15] f90/bug% time a.out
 Time, MATMUL:179.624268   23.935708966.845732457726655 
 Time, MATMUL:178.685364   23.898668166.949337929628541 
357.978u 0.447s 0:47.95 747.4%  0+0k 0+0io 0pf+0w

Thus the BLAS provided by darwin gets ~67GFlops out of the ~90GFlops peak
(AVX*4cores), while the inlined MATMUL gets ~4GFlops out of ~15Gflops peak (no
AVX, one core and turboboost) with little gain when using AVX (~30GFlops peak).

I suppose most modern OS provide such optimized BLAS and, if not, one can
install libraries such as atlas. So I wonder if it would not be more effective
to be able to configure with something such as --with-blas="magic incantation"
and use -fexternal-blas as the default rather than reinventing the wheel.

More than three years ago Janne Blomqvist (comment 7) wrote
> IIRC I reached about 30-40 % of peak flops which was a bit disappointing.

Would it be possible to have the patch to play with?


[Bug libfortran/51119] MATMUL slow for large matrices

2015-10-31 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

--- Comment #13 from Thomas Koenig  ---
(In reply to Dominique d'Humieres from comment #12)

> I suppose most modern OS provide such optimized BLAS and, if not, one can
> install libraries such as atlas. So I wonder if it would not be more
> effective to be able to configure with something such as --with-blas="magic
> incantation" and use -fexternal-blas as the default rather than reinventing
> the wheel.

If -fexternal-blas is supplied, the current implementation defaults to
-fblas-matmul-limit=30, which in turn sets -finline-matmul-limit=30
(which is fairly reasonable for the point where an
external, optimized BLAS and inlining are equally fast).

I would be interested to see where threading moves this intersection.

[Bug libfortran/51119] MATMUL slow for large matrices

2013-04-01 Thread tkoenig at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119



Thomas Koenig tkoenig at gcc dot gnu.org changed:



   What|Removed |Added



 Depends on||37131



--- Comment #11 from Thomas Koenig tkoenig at gcc dot gnu.org 2013-04-01 
15:58:52 UTC ---

A bit like PR 37131 (but I don't want to lose either audit trail).


[Bug libfortran/51119] MATMUL slow for large matrices

2013-03-29 Thread Joost.VandeVondele at mat dot ethz.ch


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119



Joost VandeVondele Joost.VandeVondele at mat dot ethz.ch changed:



   What|Removed |Added



   Last reconfirmed|2011-11-14 00:00:00 |2013-03-29



--- Comment #10 from Joost VandeVondele Joost.VandeVondele at mat dot ethz.ch 
2013-03-29 08:47:39 UTC ---

What about compiling the fortran runtime library with vectorization, and all

the fancy options that come with graphite (loop-blocking in particular). If

they don't work for a matrix multiplication pattern  what's their use ?

Further naivety would be to provide an lto'ed runtime, allowing matrix

multiplication to be inlined for known small bounds ... kind of the ultimate

dogfooding ?


[Bug libfortran/51119] MATMUL slow for large matrices

2012-06-29 Thread Joost.VandeVondele at mat dot ethz.ch
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

Joost VandeVondele Joost.VandeVondele at mat dot ethz.ch changed:

   What|Removed |Added

 CC||Joost.VandeVondele at mat
   ||dot ethz.ch

--- Comment #8 from Joost VandeVondele Joost.VandeVondele at mat dot ethz.ch 
2012-06-29 07:19:03 UTC ---
(In reply to comment #7)
 (In reply to comment #6)
  Janne, have you had a chance to look at this ? For larger matrices MATMMUL 
  is
  really slow. Anything that includes even the most basic blocking scheme 
  should
  be faster. I think this would be a valuable improvement.
 
 I implemented a block-panel multiplication algorithm similar to GOTO BLAS and
 Eigen, but I got side-tracked by other things and never found the time to fix
 the corner-case bugs and tune performance. IIRC I reached about 30-40 % of 
 peak
 flops which was a bit disappointing.

I think 30% of peak is a good improvement over the current version (which
reaches 7% of peak (92% for MKL) for a double precision 8000x8000 matrix
multiplication) on a sandy bridge.

In addition to blocking, is the Fortran runtime being compiled with a set of
compile options that enables vectorization ? In the ideal world, gcc would
recognize the loop pattern in the runtime library code, and do blocking,
vectorization etc. automagically.


[Bug libfortran/51119] MATMUL slow for large matrices

2012-06-29 Thread steven at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

Steven Bosscher steven at gcc dot gnu.org changed:

   What|Removed |Added

 CC||steven at gcc dot gnu.org

--- Comment #9 from Steven Bosscher steven at gcc dot gnu.org 2012-06-29 
10:55:48 UTC ---
(In reply to comment #7)
 IIRC I reached about 30-40 % of peak
 flops which was a bit disappointing.

This sounds quite impressive to me, actually.

It would be interesting to investigate using the IFUNC mechanism to provide
optimized (e.g. vectorized) versions of some of the library functions.


[Bug libfortran/51119] MATMUL slow for large matrices

2012-06-28 Thread Joost.VandeVondele at mat dot ethz.ch
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

--- Comment #6 from Joost VandeVondele Joost.VandeVondele at mat dot ethz.ch 
2012-06-28 11:58:20 UTC ---
Janne, have you had a chance to look at this ? For larger matrices MATMMUL is
really slow. Anything that includes even the most basic blocking scheme should
be faster. I think this would be a valuable improvement.


[Bug libfortran/51119] MATMUL slow for large matrices

2012-06-28 Thread jb at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

--- Comment #7 from Janne Blomqvist jb at gcc dot gnu.org 2012-06-28 12:15:05 
UTC ---
(In reply to comment #6)
 Janne, have you had a chance to look at this ? For larger matrices MATMMUL is
 really slow. Anything that includes even the most basic blocking scheme should
 be faster. I think this would be a valuable improvement.

I implemented a block-panel multiplication algorithm similar to GOTO BLAS and
Eigen, but I got side-tracked by other things and never found the time to fix
the corner-case bugs and tune performance. IIRC I reached about 30-40 % of peak
flops which was a bit disappointing.


[Bug libfortran/51119] MATMUL slow for large matrices

2011-11-15 Thread Joost.VandeVondele at mat dot ethz.ch
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

--- Comment #3 from Joost VandeVondele Joost.VandeVondele at mat dot ethz.ch 
2011-11-15 12:19:59 UTC ---
(In reply to comment #1)
 I have a cunning plan.

It is doable to come within a factor of 2 of highly efficient implementations
using a cache-oblivious matrix multiply, which is relatively easy to code. I'm
not sure this is worth the effort.

I believe it would be more important to have actually highly efficient
(inlined) implementations for very small matrices. These would outperform
general libraries by a large factor. For CP2K I have written a specialized
small matrix multiply library generator which generates code that outperforms
e.g. MKL by a large factor for small matrices (32x32). The generation time
and library size do not make it a general purpose tool. It also contains an
implementation of the recursive multiply of some sort (see
http://cvs.berlios.de/cgi-bin/viewvc.cgi/cp2k/cp2k/tools/build_libsmm/)


[Bug libfortran/51119] MATMUL slow for large matrices

2011-11-15 Thread Joost.VandeVondele at mat dot ethz.ch
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

--- Comment #4 from Joost VandeVondele Joost.VandeVondele at mat dot ethz.ch 
2011-11-15 12:31:10 UTC ---
Created attachment 25826
  -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=25826
comparison in performance for small matrix multiplies (libsmm vs mkl)

added some data showing the speedup of specialized matrix multiply code (small
matrices, known bounds, in cache) against general dgemm (mkl).


[Bug libfortran/51119] MATMUL slow for large matrices

2011-11-15 Thread jb at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

--- Comment #5 from Janne Blomqvist jb at gcc dot gnu.org 2011-11-15 15:47:54 
UTC ---
(In reply to comment #3)
 I believe it would be more important to have actually highly efficient
 (inlined) implementations for very small matrices.

There's already PR 37131 for that.


[Bug libfortran/51119] MATMUL slow for large matrices

2011-11-14 Thread burnus at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

Tobias Burnus burnus at gcc dot gnu.org changed:

   What|Removed |Added

 CC||burnus at gcc dot gnu.org

--- Comment #2 from Tobias Burnus burnus at gcc dot gnu.org 2011-11-14 
13:08:49 UTC ---
(In reply to comment #0)
 Compared to ATLAS BLAS on an AMD 10h processor, MATMUL on square matrices with
 n  256 is around a factor of 8 slower. 

Side note: You can use -fexternal-blas -fblas-matmul-limit=... and link ATLAS
BLAS.

 Assigning to myself.
 I have a cunning plan.

I am looking forward to cunning ideas - at least if they are not too
convoluted, work on all targets and are middle-end friendly.


[Bug libfortran/51119] MATMUL slow for large matrices

2011-11-13 Thread jb at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

Janne Blomqvist jb at gcc dot gnu.org changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2011-11-14
 AssignedTo|unassigned at gcc dot   |jb at gcc dot gnu.org
   |gnu.org |
 Ever Confirmed|0   |1

--- Comment #1 from Janne Blomqvist jb at gcc dot gnu.org 2011-11-14 06:49:11 
UTC ---
Assigning to myself.

I have a cunning plan.