[Bug fortran/48636] Enable more inlining with -O2 and higher

2019-01-30 Thread dominiq at lps dot ens.fr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

Dominique d'Humieres  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #45 from Dominique d'Humieres  ---
The last comment is almost six year old, closing as FIXED. If I missed some
issues they should be tested against a current release(trunk is now 9.0) and a
new PR should be filed.

[Bug fortran/48636] Enable more inlining with -O2 and higher

2013-03-04 Thread wschmidt at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636



William J. Schmidt wschmidt at gcc dot gnu.org changed:



   What|Removed |Added



 CC||bergner at gcc dot gnu.org



--- Comment #44 from William J. Schmidt wschmidt at gcc dot gnu.org 
2013-03-04 17:53:17 UTC ---

Compiling mgrid.f on powerpc64-unknown-linux-gnu as follows:



$ gfortran -S -m32 -O3 -mcpu=power7 -fpeel-loops -funroll-loops -ffast-math

-fvect-cost-model mgrid.f



I examined the assembly generated for revisions 193330, 193331 (this issue),

and 196171 (PR55334).  What I'm seeing is that for both 193331 and 196171, the

inliner is much more aggressive, and in particular is inlining several copies

of some pretty large functions.



For -m32, I am not seeing any specialization of resid_, so although the change

in 196171 helped a little, it appears that this was by reducing overall code

size.  There weren't any changes in inlining decisions.  Of course there is a

lot of distance between 193331 and 196171, so it is not a perfect comparison,

though it appears 196171 is where -m32 received a slight boost.



Anyway, the non-inlined call tree for 193330 is:



 main

  MAIN__

   resid_ (x4)

comm3_

   psinv_ (x3)

comm3_

   norm2u3_ (x2)

   interp_ (x2)

   setup_

   rprj3_ (x4)

   zran3_



The non-inlined call tree for 193331 is:



 main

  MAIN__

   comm3_ (x5)

   resid_

comm3_

   norm2u3_ (x2)

   setup_

   zran3_



So with 193331 we have the following additional inlines:



  3 inlines of resid_,  size = 1068, total size = 3204

  3 inlines of psinv_,  size = 1046, total size = 3138

  2 inlines of interp_, size = 1544, total size = 3088

  4 inlines of rprj3_,  size = 220,  total size = 880



Here size is the number of lines of assembly code of the called procedure,

including labels, so it's just a rough measure.  The number of static call

sites of comm3_ was also reduced by one, but I don't know whether it was

inlined or specialized away.



These are pretty large procedures to be duplicating, particularly to be

duplicating more than once.  Looking at resid_, it already generates spill code

on its own, so putting 3 copies of this in its caller isn't likely to be very

helpful.  Of these, I think only rprj3_ looks like a reasonable inline

candidate.



Total lines of the assembly files are:



  8660 r193330/mgrid.s

 16398 r193331/mgrid.s

 14592 r196171/mgrid.s



Inlining creates unreachable code, so removing the unreachable procedures

gives:



  7765 r193330/mgrid.s

 12591 r193331/mgrid.s

 10795 r196171/mgrid.s



With r196171 the reachable code is still about 40% larger than r193330 (where

some reasonable inlining was already being done).  This is better than the 60%

bloat with r193331 but still seems too high.  Again, these are rough measures

but I think they are indicative.



Without knowing anything about the inliner, I think the inlining heuristics

probably need to take more account of code size than they seem to do at the

moment, particularly when making more than one copy of a procedure and thus

reducing spatial locality.


[Bug fortran/48636] Enable more inlining with -O2 and higher

2013-03-01 Thread wschmidt at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636



William J. Schmidt wschmidt at gcc dot gnu.org changed:



   What|Removed |Added



 CC||wschmidt at gcc dot gnu.org



--- Comment #43 from William J. Schmidt wschmidt at gcc dot gnu.org 
2013-03-01 17:48:51 UTC ---

(In reply to comment #38)

 Looks like for x86 r193331 led to significant regression on 172.mgrid for -m32

 -O3 -funroll-loops



The same degradation was seen on powerpc64-unknown-linux-gnu with r193331.  The

fix by Martin Jambor for PR55334 did not help for -m32.  It did give a slight

bump to -m64, but did not return the performance to pre-r193331 levels.  So

there still seems to be a problem with 172.mgrid related to this change.


[Bug fortran/48636] Enable more inlining with -O2 and higher

2012-11-16 Thread dominiq at lps dot ens.fr


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636



--- Comment #42 from Dominique d'Humieres dominiq at lps dot ens.fr 
2012-11-16 14:42:33 UTC ---

 Fatigue now gets all inlining with -O3 -fwhole-program, with -O3 it gets only

 half of inlining because jump functions are not able to track array 
 descriptors

 in both calls to generalized_hookes_law.  



The same applies to the tests in comments #4 and #10.

What is the status of assumed-shape arrays with respect to array_index

hint?



 What are the other testcases to look at?



ac.f90 and aermod.f90 give a 5-10% faster runtime when compiled with '--param

max-inline-insns-auto=150' (and -Ofast -fwhole-program).


[Bug fortran/48636] Enable more inlining with -O2 and higher

2012-11-14 Thread hubicka at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636



--- Comment #39 from Jan Hubicka hubicka at gcc dot gnu.org 2012-11-14 
23:22:40 UTC ---

Hmm, indeed. Good catch. I will look into it.


Re: [Bug fortran/48636] Enable more inlining with -O2 and higher

2012-11-14 Thread Jan Hubicka
mgrid do not seem to be sensitive to --param min-inline-speedup, so it seems 
independent regression of this change.
No idea what goes wrong.

Honza


[Bug fortran/48636] Enable more inlining with -O2 and higher

2012-11-14 Thread hubicka at ucw dot cz


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636



--- Comment #40 from Jan Hubicka hubicka at ucw dot cz 2012-11-14 23:54:44 
UTC ---

mgrid do not seem to be sensitive to --param min-inline-speedup, so it seems

independent regression of this change.

No idea what goes wrong.



Honza


[Bug fortran/48636] Enable more inlining with -O2 and higher

2012-11-14 Thread hubicka at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636



--- Comment #41 from Jan Hubicka hubicka at gcc dot gnu.org 2012-11-15 
02:28:26 UTC ---

mgrid regression is now PR55334


[Bug fortran/48636] Enable more inlining with -O2 and higher

2012-11-12 Thread hubicka at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636



--- Comment #37 from Jan Hubicka hubicka at gcc dot gnu.org 2012-11-12 
12:16:18 UTC ---

Fatigue now gets all inlining with -O3 -fwhole-program, with -O3 it gets only

half of inlining because jump functions are not able to track array descriptors

in both calls to generalized_hookes_law.  



What are the other testcases to look at?


[Bug fortran/48636] Enable more inlining with -O2 and higher

2012-11-12 Thread izamyatin at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636



Igor Zamyatin izamyatin at gmail dot com changed:



   What|Removed |Added



 CC||izamyatin at gmail dot com



--- Comment #38 from Igor Zamyatin izamyatin at gmail dot com 2012-11-12 
12:44:43 UTC ---

Looks like for x86 r193331 led to significant regression on 172.mgrid for -m32

-O3 -funroll-loops


[Bug fortran/48636] Enable more inlining with -O2 and higher

2012-11-11 Thread hubicka at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636



--- Comment #36 from Jan Hubicka hubicka at gcc dot gnu.org 2012-11-11 
18:14:40 UTC ---

Author: hubicka

Date: Sun Nov 11 18:14:35 2012

New Revision: 193406



URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=193406

Log:



PR middle-end/48636

* ipa-inline.c (want_inline_small_function_p): Take aray index hint.

(edge_badness): Likewise.

* ipa-inline.h (inline_hints_vals): Add array_index and comments.

(inline_summary_: Add ARRAY_INDEX.

* ipa-inline-analysis.c (dump_inline_hints): Dump array_index hint.

(reset_inline_summary): Handle array_index hint.

(inline_node_duplication_hook): Likewise.

(dump_inline_summary): Likewise.

(array_index_predicate): New function.

(estimate_function_body_sizes): Use it.

(estimate_node_size_and_time): Use array_index hint.

(inline_merge_summary, inline_read_section): Likewise.



Modified:

trunk/gcc/ChangeLog

trunk/gcc/ipa-inline-analysis.c

trunk/gcc/ipa-inline.c

trunk/gcc/ipa-inline.h


[Bug fortran/48636] Enable more inlining with -O2 and higher

2012-11-08 Thread hubicka at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636



--- Comment #35 from Jan Hubicka hubicka at gcc dot gnu.org 2012-11-08 
16:46:28 UTC ---

Author: hubicka

Date: Thu Nov  8 16:46:18 2012

New Revision: 193331



URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=193331

Log:

PR middle-end/48636

* ipa-inline.c (big_speedup_p): New function.

(want_inline_small_function_p): Use it.

(edge_badness): Dump it.

* params.def (inline-min-speedup): New parameter.

* doc/invoke.texi (inline-min-speedup): Document.



Modified:

trunk/gcc/ChangeLog

trunk/gcc/doc/invoke.texi

trunk/gcc/ipa-inline.c

trunk/gcc/params.def

trunk/gcc/testsuite/ChangeLog

trunk/gcc/testsuite/gcc.dg/winline-3.c


[Bug fortran/48636] Enable more inlining with -O2 and higher

2012-11-07 Thread hubicka at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636



--- Comment #33 from Jan Hubicka hubicka at gcc dot gnu.org 2012-11-07 
09:34:25 UTC ---

Created attachment 28628

  -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=28628

Final patch (I hope)



This is version of path I will commit today or tomorrow (depending on when

autotesters gets Martin's aggregate changes). Of course it no longer helps

fatigue - we now inline the loop and do not hit the loop based heuristics.  I

made separate patch for this.  It however helps several other testcases and

should be win for Fortrain in general.


[Bug fortran/48636] Enable more inlining with -O2 and higher

2012-11-07 Thread hubicka at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636



--- Comment #34 from Jan Hubicka hubicka at gcc dot gnu.org 2012-11-07 
11:17:28 UTC ---

Created attachment 28629

  -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=28629

Array index hint



This patch should help to inline when array descriptors become known, such as

in fatigue


[Bug fortran/48636] Enable more inlining with -O2 and higher

2012-10-28 Thread hubicka at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636



--- Comment #30 from Jan Hubicka hubicka at gcc dot gnu.org 2012-10-28 
10:08:23 UTC ---

Created attachment 28543

  -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=28543

Updated patch



Hi,

this is updated patch I am testing. It fixes the big speedup test and also

changes badness metric completely to be based on expected speedup of the

caller. It seems to work pretty well in my test with the catch that I still

can't beat 4.6 on tramp3d.


[Bug fortran/48636] Enable more inlining with -O2 and higher

2012-10-28 Thread hubicka at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636



--- Comment #31 from Jan Hubicka hubicka at gcc dot gnu.org 2012-10-28 
10:11:13 UTC ---

Concerning vincenzo's request about 4.7 version, it won't work - it depends on

improvements of inline metric and ipa-prop we made for 4.8


[Bug fortran/48636] Enable more inlining with -O2 and higher

2012-10-28 Thread vincenzo.innocente at cern dot ch

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #32 from vincenzo Innocente vincenzo.innocente at cern dot ch 
2012-10-28 11:27:22 UTC ---
In a small test (that I will eventually publish here) the new patch at -O2
looks superior to 4.7.2 at O3.
I would like to build a test with multiple source files where lto matters
though.
We will also try to build our whole software stack with 4.8 (we have a
production version with 4.7.2 at this point, so we can move to experimenting
builds with 4.8…)


[Bug fortran/48636] Enable more inlining with -O2 and higher

2012-10-20 Thread hubicka at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636



--- Comment #27 from Jan Hubicka hubicka at gcc dot gnu.org 2012-10-20 
10:34:58 UTC ---

Thank you for testing. It seems that the patch works well for small benchmarks,

I will look into lapack/test_fpu slowdown.

There is problem that it really causes inacceptable growth on SPEC2k6 and 2k in

non-LTO mode.  I will need to analyze some of these testcases and see why we

predict so much of speedup when there are no benefits in runtime.



Jakub: the plan is to make ipa-cp to handle propagation across aggregates in

general (jump functions are already in place), that will handle the array

descriptors, too.



The fatigue is however different case - the values happens to be loop invariant

of the outer loop the function is called from. So inlining enables a lot of

invariant code motion.  This is similar to cray. Both these cases are now

understood by ipa-inline-analysis but the fact is not really used w/o this

patch.


[Bug fortran/48636] Enable more inlining with -O2 and higher

2012-10-20 Thread dominiq at lps dot ens.fr


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636



--- Comment #28 from Dominique d'Humieres dominiq at lps dot ens.fr 
2012-10-20 11:22:16 UTC ---

If I understand correctly the patch, the default value for

max-inline-min-speedup is 20. This could be over-agressive: for fatigue.f90 the

threshold is between 94 (fast) and 95 (slow). I see a similar threshold for

test_fpu.f90 from 100 (fast) to ~97 (slow), unfortunately above 94.



Also the choice of true or false for big_speedup_p is based on a test

't-t1a*t' which is equivalent to 't1=b*t' with b=1-a. Any reason for the

choice?



Last point, AFAICT the behavior of the different param tuning the inlining is

often non monotonic (I am trying to investigate that in more detail).


[Bug fortran/48636] Enable more inlining with -O2 and higher

2012-10-20 Thread tkoenig at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636



--- Comment #29 from Thomas Koenig tkoenig at gcc dot gnu.org 2012-10-20 
12:10:49 UTC ---

Another approach (not for the benchmarks) would be to

make inlining tunable by the user, e.g. support



!GCC$ ATTRIBUTES always_inline :: procedure_name



See PR 41209.


[Bug fortran/48636] Enable more inlining with -O2 and higher

2012-10-19 Thread vincenzo.innocente at cern dot ch


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636



--- Comment #26 from vincenzo Innocente vincenzo.innocente at cern dot ch 
2012-10-19 08:45:03 UTC ---

I'm interested to test the patch on our large application currently compiled

with 4.7.2.

would it be possible to get the same patch against gcc-4_7-branch?

thanks


[Bug fortran/48636] Enable more inlining with -O2 and higher

2012-10-17 Thread jakub at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636



Jakub Jelinek jakub at gcc dot gnu.org changed:



   What|Removed |Added



 CC||jakub at gcc dot gnu.org



--- Comment #23 from Jakub Jelinek jakub at gcc dot gnu.org 2012-10-17 
12:20:05 UTC ---

If the middle-end was hinted something about the array descriptors (what fields

in the struct are important and what not), perhaps IPA-CP could handle also

arguments pointing to array descriptors if all callers fill the descriptor in

certain important way (where important would be constant strides and/or

constant start/end in some of the dimension(s)).  I guess many Fortran routines

are just too large for inlining, but optimizing constant strides etc. might

still be useful.


[Bug fortran/48636] Enable more inlining with -O2 and higher

2012-10-17 Thread dominiq at lps dot ens.fr


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636



--- Comment #24 from Dominique d'Humieres dominiq at lps dot ens.fr 
2012-10-17 13:13:24 UTC ---

Summary for the polyhedron tests (pb05):



(a) revision 192449 unpatched

(b) revision 192516 with patch in comment #20

options: -fprotect-parens -Ofast -funroll-loops -ftree-loop-linear \

 -fomit-frame-pointer -fwhole-program -flto -fno-tree-loop-if-convert



Benchmark CompileExecutable  Ave Run   

 Name  (secs)   (bytes)   (secs)  

  (a)   (b)(a) (b)(a)   (b) 

-   ---   ---  --  --   ---   ---

   ac  6.30  9.66   54904   54856  9.12  8.49

   aermod129.56178.13 1158904 1527680 17.69 17.28

  air 25.64 29.95  102752  106712  7.15  7.16

 capacita  6.34 18.99   69096  130536 42.00 40.57

  channel  2.99  3.51   34736   34736  3.03  3.02

doduc 19.51 23.70  163528  196016 27.54 27.65

  fatigue  7.03  7.54   69312   69248  4.92  2.94

  gas_dyn 14.65 14.91   99320   99280  3.72  3.88

   induct 19.50 21.02  149160  161552 13.18 13.14

linpk  1.84  2.91   22008   17832 21.65 21.69

 mdbx  6.59 10.14   68800   80968 12.85 12.78

   nf  6.34 24.25   59544  104544 18.95 18.88

  protein 22.27 46.34  119056  156048 35.65 35.41

   rnflow 17.75 31.23  114600  147280 23.56 23.54

 test_fpu 10.69 20.66   76640  117512 10.89 10.95

 tfft  1.62  2.93   22424   22384  3.34  3.33



Geometric Mean Execution Time =  (a) 11.88 seconds (b) 11.43 seconds



I also see many failures for the gcc.dg/tree-ssa/slsr-* tests: slsr-2.c to

slsr-11.c, slsr-14.c to slsr-20.c, slsr-24.c, and slsr-25.c, and for

gfortran.dg/vect/vect-8.f90: the loop



do while(ii   1)

ipnt= ipntp

ipntp= ipntp+ii

ii= ishft(ii,-1)

i= ipntp+1

!dir$ vector always

   x(ipntp+2:ipntp+ii+1)=x(ipnt+2:ipntp:2)-v(ipnt+2:ipntp:2) 

 *x(ipnt+1:ipntp-1:2)-v(ipnt+3:ipntp+1:2)*x(ipnt+3:ipntp+1:2)

END DO



is not vectorized because



154: not vectorized: complicated access pattern.


[Bug fortran/48636] Enable more inlining with -O2 and higher

2012-10-17 Thread dominiq at lps dot ens.fr


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636



--- Comment #25 from Dominique d'Humieres dominiq at lps dot ens.fr 
2012-10-17 14:05:51 UTC ---

 I also see many failures for the gcc.dg/tree-ssa/slsr-* tests: slsr-2.c to

 slsr-11.c, slsr-14.c to slsr-20.c, slsr-24.c, and slsr-25.c, and for

 gfortran.dg/vect/vect-8.f90: the loop ...



This seems to be a glitch which appeared between r192440 and r192516 and

disappeared before or at r192531.


[Bug fortran/48636] Enable more inlining with -O2 and higher

2012-10-16 Thread hubicka at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636



--- Comment #20 from Jan Hubicka hubicka at gcc dot gnu.org 2012-10-16 
16:38:27 UTC ---

Created attachment 28456

  -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=28456

Path I am considering



Hi,

I am considering to enable inlining when inline-analysis says that the inline

function will get significantly fater regardless the

inline-insns-single/inline-sinsns-auto limits. The patch is attached.



In general it may make more sense to gradually set the limits based on expected

speedup but I am affraid this will become hard to understand and maintain. So

for now lets have simple boolean decisions.



It would be nice to know where it helps (i.e. for fatigue and cray) and where

it doesn't.  It also causes quite considerable code size growth on some of

SPEC2000 for relatively little benefit, so I guess it will need more evaulation

and reduction of inline-insns-auto limits. It also may be problem in

unrealistic estimates in ipa-inline-analysis, this is first time we take them

really seriously.



Comments/ideas are welcome.


[Bug fortran/48636] Enable more inlining with -O2 and higher

2012-10-16 Thread dominiq at lps dot ens.fr


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636



--- Comment #21 from Dominique d'Humieres dominiq at lps dot ens.fr 
2012-10-16 17:57:52 UTC ---

Before the patch in comment #20, I get



-rwxr-xr-x 1 dominiq staff 73336 Oct 16 19:19 a.out*

[macbook] lin/test% time gfc -fprotect-parens -Ofast -funroll-loops

-ftree-loop-linear -fomit-frame-pointer --param max-inline-insns-auto=150

-fwhole-program -flto -fno-tree-loop-if-convert fatigue.f90

8.485u 0.205s 0:08.73 99.4%0+0k 0+29io 0pf+0w

[macbook] lin/test% ll a.out

-rwxr-xr-x 1 dominiq staff 73336 Oct 16 19:19 a.out*

[[macbook] lin/test% time a.out  /dev/null

2.916u 0.003s 0:02.92 99.6%0+0k 0+1io 0pf+0w



[macbook] lin/test% time gfc -fprotect-parens -Ofast -funroll-loops

-ftree-loop-linear -fomit-frame-pointer -fwhole-program -flto

-fno-tree-loop-if-convert fatigue.f90

6.822u 0.193s 0:07.06 99.2%0+0k 0+30io 0pf+0w

[macbook] lin/test% ll a.out   

 

-rwxr-xr-x 1 dominiq staff 69312 Oct 16 19:21 a.out*

[macbook] lin/test% time a.out  /dev/null

4.851u 0.004s 0:04.86 99.7%0+0k 0+1io 0pf+0w



After the patch I get



[macbook] lin/test% time gfc -fprotect-parens -Ofast -funroll-loops

-ftree-loop-linear -fomit-frame-pointer -fwhole-program -flto

-fno-tree-loop-if-convert fatigue.f90

7.277u 0.217s 0:07.52 99.4%0+0k 0+28io 0pf+0w

[macbook] lin/test% ll a.out-rwxr-xr-x 1 dominiq staff 69248 Oct 16 19:46

a.out*

[macbook] lin/test% time a.out  /dev/null

2.912u 0.003s 0:02.91 100.0%0+0k 0+2io 0pf+0w



So for this particular test with the same options, after the patch the

compilation time is ~6% slower, the size is about the same (actually smaller;-)

and the run time ~40% faster. Without the patch and with --param

max-inline-insns-auto=150 compared to with the patch without this option, the

compilation time is ~20% slower, the size is ~6% larger, and the runtime is the

same.



Further testing coming, thanks for the patch.


[Bug fortran/48636] Enable more inlining with -O2 and higher

2012-10-16 Thread dominiq at lps dot ens.fr


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636



--- Comment #22 from Dominique d'Humieres dominiq at lps dot ens.fr 
2012-10-16 20:58:58 UTC ---

With the patch I see a ~10% slowdown in the Test4 - Lapack 2 (1001x1001) of

test_fpu.f90 compared to revision 192449



[macbook] lin/test% time /opt/gcc/gcc4.8c/bin/gfortran -fprotect-parens -Ofast

-funroll-loops test_lap.f90

6.742u 0.097s 0:06.87 99.4%0+0k 0+20io 0pf+0w

[macbook] lin/test% a.out

  Benchmark running, hopefully as only ACTIVE task

Test4 - Lapack 2 (1001x1001) inverts  2.6 sec  Err= 0.250

 total =  2.6 sec



[macbook] lin/test% time gfc -fprotect-parens -Ofast -funroll-all-loops

test_lap.f90

9.489u 0.116s 0:09.62 99.6%0+0k 0+16io 0pf+0w

[macbook] lin/test% a.out

  Benchmark running, hopefully as only ACTIVE task

Test4 - Lapack 2 (1001x1001) inverts  2.8 sec  Err= 0.250

 total =  2.8 sec



This looks similar to what I saw in comment #5. However now dgetri is never

inlined while dgetrf is inlined with the patch. Also dtrmv and dscal are

inlined with the patch (respectively 20 and 21 occurrences without the patch).

The last difference I see is 35 occurrences of dswap with the patch compared to

32 without.


[Bug fortran/48636] Enable more inlining with -O2 and higher

2012-09-12 Thread hubicka at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #19 from Jan Hubicka hubicka at gcc dot gnu.org 2012-09-12 
21:51:21 UTC ---
Author: hubicka
Date: Wed Sep 12 21:51:14 2012
New Revision: 191232

URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=191232
Log:

PR fortran/48636
* gcc.dg/ipa/inlinehint-2.c: New testcase.
* ipa-inline-analysis.c (dump_inline_hints): Dump loop stride.
(set_hint_predicate): New function.
(reset_inline_summary): Reset loop stride.
(remap_predicate_after_duplication): New function.
(remap_hint_predicate_after_duplication): New function.
(inline_node_duplication_hook): Update.
(dump_inline_summary): Dump stride summaries.
(estimate_function_body_sizes): Compute strides.
(remap_hint_predicate): New function.
(inline_merge_summary): Use it.
(inline_read_section): Read stride.
(inline_write_summary): Write stride.
* ipa-inline.c (want_inline_small_function_p): Handle strides.
(edge_badness): Likewise.
* ipa-inline.h (inline_hints_vals): Add stride hint.
(inline_summary): Update stride.

Added:
trunk/gcc/testsuite/gcc.dg/ipa/inlinehint-2.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/ipa-inline-analysis.c
trunk/gcc/ipa-inline.c
trunk/gcc/ipa-inline.h
trunk/gcc/testsuite/ChangeLog


[Bug fortran/48636] Enable more inlining with -O2 and higher

2012-08-21 Thread hubicka at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #17 from Jan Hubicka hubicka at gcc dot gnu.org 2012-08-21 
06:54:09 UTC ---
Author: hubicka
Date: Tue Aug 21 06:54:01 2012
New Revision: 190556

URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=190556
Log:
PR fortran/48636
* ipa-inline.c (want_inline_small_function_p): Take loop_iterations hint.
(edge_badness): Likewise.
* ipa-inline.h (inline_hints_vals): Add INLINE_HINT_loop_iterations.
(inline_summary): Add loop_iterations.
* ipa-inline-analysis.c: Include tree-scalar-evolution.h.
(dump_inline_hints): Dump loop_iterations.
(reset_inline_summary): Free loop_iterations.
(inline_node_duplication_hook): Update loop_iterations.
(dump_inline_summary): Dump loop_iterations.
(will_be_nonconstant_expr_predicate): New function.
(estimate_function_body_sizes): Analyze loops.
(estimate_node_size_and_time): Set hint loop_iterations.
(inline_merge_summary): Merge loop iterations.
(inline_read_section): Stream in loop_iterations.
(inline_write_summary): Stream out loop_iterations.

Added:
trunk/gcc/testsuite/gcc.dg/ipa/inlinehint-1.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/ipa-inline-analysis.c
trunk/gcc/ipa-inline.c
trunk/gcc/ipa-inline.h
trunk/gcc/testsuite/ChangeLog


[Bug fortran/48636] Enable more inlining with -O2 and higher

2012-08-21 Thread hubicka at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #18 from Jan Hubicka hubicka at gcc dot gnu.org 2012-08-21 
08:14:33 UTC ---
With loop_iterations hint, we should now hint the bar function of testcase in
comment #4, but we don't because the value is used conditionally:

  # iftmp.11_3 = PHI _12(3), 1(2)
  a.0_16 = a_11(D)-data;
  _17 = a_11(D)-dim[0].ubound;
  _18 = a_11(D)-dim[0].lbound;
  _19 = _17 - _18;
  ubound.0_20 = _19 + 1;
  stride.3_21 = a_11(D)-dim[1].stride;
  _22 = a_11(D)-dim[1].ubound;
  _23 = a_11(D)-dim[1].lbound;
  _24 = _22 - _23;
  ubound.2_25 = _24 + 1;
  _27 = -iftmp.11_3;
  offset.4_28 = _27 - stride.3_21;
  *x_35(D) = 0.0;
  _53 = stride.3_21 = 0;
  _56 = ubound.2_25  0;
  _57 = _53  _56;
  _59 = stride.3_21  0;
  _60 = _57 | _59;
  _66 = _59 | _60;
  if (_66 != 0)
goto bb 5;
  else
goto bb 6;

  bb 5:
  iftmp.13_68 = (integer(kind=4)) ubound.2_25;

  bb 6:
  # iftmp.13_4 = PHI iftmp.13_68(5), 0(4)
  if (iftmp.13_4  0)
goto bb 7;
  else
goto bb 14;

Martin, does this work with your PHI patch? (i.e. do you get loop iterations
in -fdump-ipa-inline?)

Next step will be to teach inliner to inline into functions called once when
callee is important and some propagation happens.

Honza


[Bug fortran/48636] Enable more inlining with -O2 and higher

2012-08-11 Thread jamborm at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #16 from Martin Jambor jamborm at gcc dot gnu.org 2012-08-11 
10:50:29 UTC ---
Author: jamborm
Date: Sat Aug 11 10:50:24 2012
New Revision: 190313

URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=190313
Log:
2012-08-11  Martin Jambor  mjam...@suse.cz

PR fortran/48636
* ipa-inline.h (condition): New fields offset, agg_contents and by_ref.
* ipa-inline-analysis.c (agg_position_info): New type.
(add_condition): New parameter aggpos, also store agg_contents, by_ref
and offset.
(dump_condition): Also dump aggregate conditions.
(evaluate_conditions_for_known_args): Also handle aggregate
conditions.  New parameter known_aggs.
(evaluate_properties_for_edge): Gather known aggregate contents.
(inline_node_duplication_hook): Pass NULL known_aggs to
evaluate_conditions_for_known_args.
(unmodified_parm): Split into unmodified_parm and unmodified_parm_1.
(unmodified_parm_or_parm_agg_item): New function.
(set_cond_stmt_execution_predicate): Handle values passed in
aggregates.
(set_switch_stmt_execution_predicate): Likewise.
(will_be_nonconstant_predicate): Likewise.
(estimate_edge_devirt_benefit): Pass new parameter known_aggs to
ipa_get_indirect_edge_target.
(estimate_calls_size_and_time): New parameter known_aggs, pass it
recrsively to itself and to estimate_edge_devirt_benefit.
(estimate_node_size_and_time): New vector known_aggs, pass it o
functions which need it.
(remap_predicate): New parameter offset_map, use it to remap aggregate
conditions.
(remap_edge_summaries): New parameter offset_map, pass it recursively
to itself and to remap_predicate.
(inline_merge_summary): Also create and populate vector offset_map.
(do_estimate_edge_time): New vector of known aggregate contents,
passed to functions which need it.
(inline_read_section): Stream new fields of condition.
(inline_write_summary): Likewise.
* ipa-cp.c (ipa_get_indirect_edge_target): Also examine the aggregate
contents.  Let all local callers pass NULL for known_aggs.

* testsuite/gfortran.dg/pr48636.f90: New test.


Added:
trunk/gcc/testsuite/gfortran.dg/pr48636.f90
Modified:
trunk/gcc/ChangeLog
trunk/gcc/ipa-cp.c
trunk/gcc/ipa-inline-analysis.c
trunk/gcc/ipa-inline.h
trunk/gcc/ipa-prop.h
trunk/gcc/testsuite/ChangeLog


[Bug fortran/48636] Enable more inlining with -O2 and higher

2012-07-03 Thread jamborm at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #15 from Martin Jambor jamborm at gcc dot gnu.org 2012-07-03 
17:43:35 UTC ---
Hi,

(In reply to comment #12)
 Hi,
 I discussed some of the issues today with Martin.  For the array descriptor
 testcase, we really want ipa-cp to be propagate the constant array bounds
 instead of making Inliner to blindly inline enough in all cases.
 For that we need
 
   1) Make ipa-prop to work on aggregates.  For aggregates passed by value we
 can have jump functions that define known constants at known offsets

I have posted a patch implementing this yesterday:
http://gcc.gnu.org/ml/gcc-patches/2012-07/msg00039.html

   2) Make ipa-inline-analysis to produce predicates on constantness of
 aggregate fields in the same format

I have posted a patch implementing this yesterday too:
http://gcc.gnu.org/ml/gcc-patches/2012-07/msg00041.html

With this patch, the test case from comment #4 is inlined at -O3
without any parameter tweaking.  It is in fact a testcase in that
patch.  However, functions in real benchmarks are bigger.  We will see
to what extent IPA-CP can help with them.

   3) Array descriptors are passed by reference, rather than by value.  This
 need further work, since need to be sure that the value passed does not change
 by aliasing.  IPA-SRA would help here if it was really SRA and we had
 -fwhole-program, but that is weak. We would need IPA-PTA to solve this in
 general. Perhaps frontend could help us here since the descriptors are 
 probably
 constant after they are initialized, or is there way to change existing
 descriptor?

At the moment I'm relying on a slightly sophisticated intra-PTA and
TBAA.  I'll try to investigate where this does not work in Fortran and
perhaps will have some suggestions afterwards.

   4) Make ipa-inline-analysis to understand that determining loop bounds is
 very cool to do.

Yep, knowing what constants should get a profitability boost would
be indeed very beneficial.

Meanwhile, the following patch also helps ipa-inline-analysis.c with
cases like fatigue2:
http://gcc.gnu.org/ml/gcc-patches/2012-07/msg00052.html


[Bug fortran/48636] Enable more inlining with -O2 and higher

2011-06-04 Thread hubicka at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #14 from Jan Hubicka hubicka at gcc dot gnu.org 2011-06-04 
18:06:01 UTC ---
Yeah, the fnspec issue is something we ought to solve. ipa-cp should be
effective on fortran so it should not disable itself there ;)


[Bug fortran/48636] Enable more inlining with -O2 and higher

2011-05-04 Thread hubicka at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #12 from Jan Hubicka hubicka at gcc dot gnu.org 2011-05-04 
16:09:19 UTC ---
Hi,
I discussed some of the issues today with Martin.  For the array descriptor
testcase, we really want ipa-cp to be propagate the constant array bounds
instead of making Inliner to blindly inline enough in all cases.
For that we need

  1) Make ipa-prop to work on aggregates.  For aggregates passed by value we
can have jump functions that define known constants at known offsets
  2) Make ipa-inline-analysis to produce predicates on constantness of
aggregate fields in the same format
  3) Array descriptors are passed by reference, rather than by value.  This
need further work, since need to be sure that the value passed does not change
by aliasing.  IPA-SRA would help here if it was really SRA and we had
-fwhole-program, but that is weak. We would need IPA-PTA to solve this in
general. Perhaps frontend could help us here since the descriptors are probably
constant after they are initialized, or is there way to change existing
descriptor?
  4) Make ipa-inline-analysis to understand that determining loop bounds is
very cool to do.

I also looked into the dumps of fatigue. One obvious problem is that we
overestimate stack sizes by about factor of 10.  This seems to be mostly due to
I/O routines structures that gets packed later. 
We used to take results of stack frame packing, but Steven reverted this
behaviour and now we estimate stack sizes by simply summing up the size of
local arrays. I wonder, perhaps we want to revert to original way at least when
optimizing and when generating summary for late inliner (early inliner probably
does not care and Steven's main concern was that this is computed 3 times,
twice for early inliner and once for real inliner).

Honza


[Bug fortran/48636] Enable more inlining with -O2 and higher

2011-05-04 Thread burnus at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #13 from Tobias Burnus burnus at gcc dot gnu.org 2011-05-04 
17:30:31 UTC ---
(In reply to comment #12)
 Perhaps frontend could help us here since the descriptors are probably
 constant after they are initialized, or is there way to change existing
 descriptor?

Only if the dummy/formal argument is a pointer or an allocatable.

Regarding ipa-cp, wasn't there a problem with fn spec (cf. PR 45579)? And
many Fortran procedures have this attribute.

  This seems to be mostly due to I/O routines structures that gets packed 
 later.

We really need to start reusing them ... (Cf. PR 48419, PR 34705, and
http://gcc.gnu.org/ml/gcc-patches/2006-02/msg01762.html )


[Bug fortran/48636] Enable more inlining with -O2 and higher

2011-04-20 Thread jb at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

Janne Blomqvist jb at gcc dot gnu.org changed:

   What|Removed |Added

 CC||jb at gcc dot gnu.org

--- Comment #6 from Janne Blomqvist jb at gcc dot gnu.org 2011-04-20 11:21:15 
UTC ---
Note that some of these issues might change with the new array descriptor that
we must introduce at some point (the hope is that it'll get in for 4.7, but it
remains to be seen if there is enough time). See

http://gcc.gnu.org/wiki/ArrayDescriptorUpdate

For instance (comments inline):

(In reply to comment #4)
 (In reply to comment #3)
 
  The second item is interesting - it would be cool if backend was able to 
  work
  out that the code is supposed to simplify after inlining. Either by itself 
  or
  by frontend hint.
  Can you provide me very simple testcase for that I can look into how it 
  looks
  like in backend?  Perhaps some kind of frontend hinting would work well 
  here.
 
 Here is some sample code (extreme, I admit) which profits a lot from
 inlining:
 
 - Strides are known to be one when inlining (a common case, but you can
   never be sure if the user doesn't call a(1:5:2))

Not strictly related to inlining, but in the new descriptor we'll have a field
specifying whether the array is simply contiguous, so it might make sense to
generate two loops for each loop over the array in the source, one for the
contiguous case where it can be vectorized etc. and another loop for the
general case.  This might reduce the profitability of inlining.

 - Expensive setting up of, and reading from the array descriptor

As we're planning to use the TR 29113 descriptor as the native one, this has
some implications for the procedure call interface as well. See

http://gcc.gnu.org/ml/fortran/2011-03/msg00215.html

This will reduce the procedure call overhead substantially, at the cost of some
extra work in the caller in the case of non-default lower bounds.


[Bug fortran/48636] Enable more inlining with -O2 and higher

2011-04-20 Thread burnus at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #7 from Tobias Burnus burnus at gcc dot gnu.org 2011-04-20 
12:29:02 UTC ---
(In reply to comment #6)
  Here is some sample code (extreme, I admit) which profits a lot from
  inlining:
  
  - Strides are known to be one when inlining (a common case, but you can
never be sure if the user doesn't call a(1:5:2))

First, you do not have any issue with strides if the dummy argument is either
allocatable, has the contiguous attribute, or is an explicit or assumed-sized
array.

For inlining, I see only one place where information loss happens: If a
simply-contiguous array is passed as actual argument to a assumed-shape dummy.
Then the Fortran front-end knows that the stride of the actual argument is 1,
but the callee needs to assume an arbitrary stride. The middle-end will
continue to do so as the simply contiguous information is lost - even though
it would be profitable for inlining.


 Not strictly related to inlining, but in the new descriptor we'll have a field
 specifying whether the array is simply contiguous

I am not sure we will indeed have one; initially I thought one should, but I am
no longer convinced that it is the right approach. My impression is now that
setting and updating the flag all the time is more expensive then doing once a
is_contiguous() check. The TR descriptor also does not such an flag - thus one
needs to handle such arrays - if they come from C - with extra care. (Unless
one requires the C side to call a function, which could set this flag. I think
one does not need to do so.)


By the way, the latest version of the TR draft is linked at
http://j3-fortran.org/pipermail/interop-tr/2011-April/000582.html


 so it might make sense to
 generate two loops for each loop over the array in the source, one for the
 contiguous case where it can be vectorized etc. and another loop for the
 general case.

Maybe. Definitely not for -Os. Best would be if the middle end would be able to
generate automatically a stride-free version when it thinks that it is
profitable. The FE could also do it, if one had a way to tell the ME that it
might drop the stride-free version, if it thinks that it is more profitable.


 As we're planning to use the TR 29113 descriptor as the native one, this has
 some implications for the procedure call interface as well. See
 http://gcc.gnu.org/ml/fortran/2011-03/msg00215.html

Regarding:
For a descriptor of an assumed-shape array, the value of the
lower-bound member of each element of the dim member of the descriptor
shall be zero.

That's actually also not that different from the current situation: In Fortran,
the lower bound of assumed-shape arrays is also always the same: It is 1. Which
makes sense as on can then do the following w/o worrying about the lbound:
  subroutine bar(a)
real :: a(:)
do i = 1, ubound(a, dim=1)
  a(i) = ...

For explicit-shape/assumed-size arrays one does not have a descriptor and for
deferred-shape arrays (allocatables, pointers) the TR keeps the lbound - which
is the same as currently in Fortran.

 This will reduce the procedure call overhead substantially, at the cost
 of some extra work in the caller in the case of non-default lower bounds.

Which is actually nothing new ... That's the reason that one often creates a
new descriptor for procedure calls.


[Bug fortran/48636] Enable more inlining with -O2 and higher

2011-04-20 Thread jb at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #8 from Janne Blomqvist jb at gcc dot gnu.org 2011-04-20 13:09:51 
UTC ---
(In reply to comment #7)
 (In reply to comment #6)
   Here is some sample code (extreme, I admit) which profits a lot from
   inlining:
   
   - Strides are known to be one when inlining (a common case, but you can
 never be sure if the user doesn't call a(1:5:2))
 
 First, you do not have any issue with strides if the dummy argument is either
 allocatable, has the contiguous attribute, or is an explicit or assumed-sized
 array.
 
 For inlining, I see only one place where information loss happens: If a
 simply-contiguous array is passed as actual argument to a assumed-shape dummy.
 Then the Fortran front-end knows that the stride of the actual argument is 1,
 but the callee needs to assume an arbitrary stride. The middle-end will
 continue to do so as the simply contiguous information is lost - even though
 it would be profitable for inlining.

Passing simply contiguous arrays to assumed-shape dummies is a fairly common
case in modern Fortran, so it would be nice if we could generate fast code
for this.

  Not strictly related to inlining, but in the new descriptor we'll have a 
  field
  specifying whether the array is simply contiguous
 
 I am not sure we will indeed have one; initially I thought one should, but I 
 am
 no longer convinced that it is the right approach. My impression is now that
 setting and updating the flag all the time is more expensive then doing once a
 is_contiguous() check.

Hmm, maybe. Shouldn't it be necessary to update the contiguous flag only when
passing slices to procedures with explicit interfaces? But OTOH, calculating
whether an array is simply contiguous at procedure entry is just a few
arithmetic operations anyway. But, in any case I don't have any profiling data
to argue which approach would be better. 

 The TR descriptor also does not such an flag - thus one
 needs to handle such arrays - if they come from C - with extra care. (Unless
 one requires the C side to call a function, which could set this flag. I think
 one does not need to do so.)

I suppose one cannot require the C side to set such a flag, as the TR doesn't
require its presence? Thus we'd need to calculate whether the array is simply
contiguous anyway if it's possible the array comes from C. Do such procedures
have to be marked with BIND(C) in some way or how does this work? 

In any case, maybe this is what tips it in favor of always calculating the
contiguousness instead of having a flag in the descriptor - it would be one
single way of handling it, reducing the possibility of bugs. Also, if the
contigousness isn't used for anything in the procedure, the dead code
elimination should delete it anyway.

  As we're planning to use the TR 29113 descriptor as the native one, this has
  some implications for the procedure call interface as well. See
  http://gcc.gnu.org/ml/fortran/2011-03/msg00215.html
 
 Regarding:
 For a descriptor of an assumed-shape array, the value of the
 lower-bound member of each element of the dim member of the descriptor
 shall be zero.
 
 That's actually also not that different from the current situation: In 
 Fortran,
 the lower bound of assumed-shape arrays is also always the same: It is 1.

Yes. But what is different from the current situation is that the above is what
the standard requires semantically, and the implementation is free to implement
it as it sees fit. In the TR, OTOH, we have the explicit requirement that on
procedure entry the lower bounds in the descriptor should be 0. This of course
applies only to inter-operable procedures, for pure Fortran we're still free
to do as we please. But again, it might make sense to do it the same way in
both cases in order to reduce the implementation and maintenance burden.

 For explicit-shape/assumed-size arrays one does not have a descriptor and for
 deferred-shape arrays (allocatables, pointers) the TR keeps the lbound - which
 is the same as currently in Fortran.

Yes.

  This will reduce the procedure call overhead substantially, at the cost
  of some extra work in the caller in the case of non-default lower bounds.
 
 Which is actually nothing new ... That's the reason that one often creates a
 new descriptor for procedure calls.

But do we actually do this? I did some tests a while ago, and IIRC for assumed
shape dummy arguments the procedure always calculates new bounds such that they
start from 1. That is, the procedure assumes that the actual argument
descriptor may have lower bounds != 1. 

So my argument is basically that with the new descriptor it might make sense to
switch the responsibility around such that it's the caller who makes sure that
all lower bounds are 0 (as we must have the capability to do this anyway in
order to call inter-operable procedures, no?) instead of the callee.


[Bug fortran/48636] Enable more inlining with -O2 and higher

2011-04-20 Thread burnus at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #9 from Tobias Burnus burnus at gcc dot gnu.org 2011-04-20 
15:39:47 UTC ---
 But do we actually do this? I did some tests a while ago, and IIRC for assumed
 shape dummy arguments the procedure always calculates new bounds such that 
 they
 start from 1. That is, the procedure assumes that the actual argument
 descriptor may have lower bounds != 1. 
 So my argument is basically that with the new descriptor it might make sense 
 to
 switch the responsibility around such that it's the caller who makes sure that
 all lower bounds are 0 (as we must have the capability to do this anyway in
 order to call inter-operable procedures, no?) instead of the callee.

No, the conversion is already done in the caller:

subroutine bar(B)
  interface;   subroutine foo(a); integer :: a(:); end subroutine foo
  end interface
integer :: B(:)
call foo(B)
end subroutine bar

Shows:
parm.4.dim[0].lbound = 1;
[...]
foo (parm.4);
For assumed-shape actual arguments, creating a new descriptor is actually not
needed - only for deferred shape ones - or if one does not have a full array
ref.
Cf. gfc_conv_array_parameter, which is called by gfc_conv_procedure_call.

However, some additional calculation is also done in the the callee to
determine the stride and offset; e.g.
  ubound.0 = (b-dim[0].ubound - b-dim[0].lbound) + 1;
again, if the dummy argument is not deferred-shaped (allocatable or pointer),
one actually knows that b-dim[0].lbound == 1. I think we have some
redundancy here - missed optimization.


[Bug fortran/48636] Enable more inlining with -O2 and higher

2011-04-20 Thread tkoenig at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #10 from Thomas Koenig tkoenig at gcc dot gnu.org 2011-04-20 
16:40:46 UTC ---
(In reply to comment #6)

 Not strictly related to inlining, but in the new descriptor we'll have a field
 specifying whether the array is simply contiguous, so it might make sense to
 generate two loops for each loop over the array in the source, one for the
 contiguous case where it can be vectorized etc. and another loop for the
 general case.  This might reduce the profitability of inlining.

Consider the following, hand-crafted matmul:

Here, we have three nested loops. The most interesting one is
the innermost loop of the matmul, which we vectorize by inlining if we omit
the call to my_matmul with non-unity stride for a when compiling with
-fwhole-program -O3.

How many versions of the loop should we generate?  Two or eight, depending
on what the caller may do? ;-)

module foo
  implicit none
contains
  subroutine my_matmul(a,b,c)
implicit none
integer :: count, m, n
real, dimension(:,:), intent(in) :: a,b
real, dimension(:,:), intent(out) :: c
integer :: i,j,k

m = ubound(a,1)
n = ubound(b,2)
count = ubound(a,2)
c = 0
do j=1,n
   do k=1, count
  do i=1,m
 c(i,j) = c(i,j) + a(i,k) * b(k,j)
  end do
   end do
end do
  end subroutine my_matmul
end module foo

program main
  use foo
  implicit none
  integer, parameter :: factor=100
  integer, parameter :: n = 2*factor, m = 3*factor, count = 4*factor
  real, dimension(m, count) :: a
  real, dimension(count, n) :: b
  real, dimension(m,n) :: c1, c2
  real, dimension(m/2, n) :: ch_1, ch_2

  call random_number(a)
  call random_number(b)
  call my_matmul(a,b,c1)
  c2 = matmul(a,b)
  if (any(abs(c1 - c2)  1e-5)) call abort
  call my_matmul(a(1:m:2,:),b,ch_1)
  ch_2 = matmul(a(1:m:2,:),b)
  if (any(abs(ch_1 - ch_2)  1e-5)) call abort
end program main


[Bug fortran/48636] Enable more inlining with -O2 and higher

2011-04-20 Thread jb at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #11 from Janne Blomqvist jb at gcc dot gnu.org 2011-04-20 
18:14:20 UTC ---
(In reply to comment #9)
  But do we actually do this? I did some tests a while ago, and IIRC for 
  assumed
  shape dummy arguments the procedure always calculates new bounds such that 
  they
  start from 1. That is, the procedure assumes that the actual argument
  descriptor may have lower bounds != 1. 
  So my argument is basically that with the new descriptor it might make 
  sense to
  switch the responsibility around such that it's the caller who makes sure 
  that
  all lower bounds are 0 (as we must have the capability to do this anyway in
  order to call inter-operable procedures, no?) instead of the callee.
 
 No, the conversion is already done in the caller:
 
 subroutine bar(B)
   interface;   subroutine foo(a); integer :: a(:); end subroutine foo
   end interface
 integer :: B(:)
 call foo(B)
 end subroutine bar
 
 Shows:
 parm.4.dim[0].lbound = 1;
 [...]
 foo (parm.4);
 For assumed-shape actual arguments, creating a new descriptor is actually not
 needed - only for deferred shape ones - or if one does not have a full array
 ref.
 Cf. gfc_conv_array_parameter, which is called by gfc_conv_procedure_call.
 
 However, some additional calculation is also done in the the callee to
 determine the stride and offset; e.g.
   ubound.0 = (b-dim[0].ubound - b-dim[0].lbound) + 1;
 again, if the dummy argument is not deferred-shaped (allocatable or pointer),
 one actually knows that b-dim[0].lbound == 1. I think we have some
 redundancy here - missed optimization.

Yes, there seems to be some redundancy indeed in that case. I dug up my
testcase:

module asstest
  implicit none
contains
  subroutine assub(a, r)
real, intent(in) :: a(:,:)
real, intent(out) :: r

r = a(42,43)
  end subroutine assub

  subroutine assub2(a, r)
real, intent(in), allocatable :: a(:,:)
real, intent(out) :: r

r = a(42,43)
  end subroutine assub2
end module asstest

The -fdump-tree-original tree for this module is:

assub2 (struct array2_real(kind=4)  a, real(kind=4)  r)
{
  *r = (*(real(kind=4)[0:] *) a-data)[(a-dim[0].stride * 42 +
a-dim[1].stride * 43) + a-offset];
}


assub (struct array2_real(kind=4)  a, real(kind=4)  r)
{
  integer(kind=8) ubound.0;
  integer(kind=8) stride.1;
  integer(kind=8) ubound.2;
  integer(kind=8) stride.3;
  integer(kind=8) offset.4;
  integer(kind=8) size.5;
  real(kind=4)[0:D.1567] * a.0;
  integer(kind=8) D.1567;
  bit_size_type D.1568;
  unnamed-unsigned:64 D.1569;

  {
integer(kind=8) D.1566;

D.1566 = a-dim[0].stride;
stride.1 = D.1566 != 0 ? D.1566 : 1;
a.0 = (real(kind=4)[0:D.1567] *) a-data;
ubound.0 = (a-dim[0].ubound - a-dim[0].lbound) + 1;
stride.3 = a-dim[1].stride;
ubound.2 = (a-dim[1].ubound - a-dim[1].lbound) + 1;
size.5 = stride.3 * NON_LVALUE_EXPR ubound.2;
offset.4 = -stride.1 - NON_LVALUE_EXPR stride.3;
D.1567 = size.5 + -1;
D.1568 = (bit_size_type) size.5 * 32;
D.1569 = (unnamed-unsigned:64) size.5 * 4;
  }
  *r = (*a.0)[(stride.1 * 42 + stride.3 * 43) + offset.4];
}

So if we make sure that the caller fixes up the descriptor so that bounds are
correct for assumed-shape parameters (as the TR requires for inter-operable
procedures), then assub could be as simple as assub2.


[Bug fortran/48636] Enable more inlining with -O2 and higher

2011-04-17 Thread dominiq at lps dot ens.fr
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #2 from Dominique d'Humieres dominiq at lps dot ens.fr 2011-04-17 
10:23:03 UTC ---
As shown by the following results it seems that --param max-inline-insns-auto=*
is the way to go.

Date  Time : 17 Apr 2011 11:22:05
Test Name   : pbharness
Compile Command : gfc %n.f90 -Ofast -funroll-loops -ftree-loop-linear
-fomit-frame-pointer --param max-inline-insns-auto=400 -fwhole-program -flto
-fstack-arrays -o %n
Benchmarks  : ac aermod air capacita channel doduc fatigue gas_dyn induct
linpk mdbx nf protein rnflow test_fpu tfft
Maximum Times   :  300.0
Target Error %  :  0.200
Minimum Repeats : 2
Maximum Repeats : 5

   Benchmark   Compile  Executable   Ave Run  Number   Estim
Name(secs) (bytes)(secs) Repeats   Err %
   -   ---  --   --- ---  --
  ac  8.07   54576  8.11   2  0.0062
  aermod175.22 1472624 18.83   2  0.1647
 air 25.65   89992  6.78   5  0.1871
capacita 14.02  109536 40.36   2  0.0483
 channel  3.11   34448  2.94   5  0.6012
   doduc 29.46  224584 27.44   2  0.0437
 fatigue  9.85   77032  2.74   2  0.0365
 gas_dyn 26.09  144112  4.68   5  0.6928
  induct 24.32  189696 14.24   2  0.1193
   linpk  3.13   21536 21.69   2  0.0254
mdbx  9.18   84776 12.55   2  0.0678
  nf 34.14  124640 18.38   2  0.1034
 protein 28.14  155624 35.48   2  0.0789
  rnflow 43.93  204176 26.70   2  0.0262
test_fpu 21.90  141696 11.18   2  0.0045
tfft  1.71   22072  3.29   5  0.1369

Geometric Mean Execution Time =  11.60 seconds



Date  Time : 17 Apr 2011 11:50:20
Test Name   : pbharness
Compile Command : gfc %n.f90 -Ofast -funroll-loops -ftree-loop-linear
-fomit-frame-pointer -finline-limit=600 -fwhole-program -flto -fstack-arrays -o
%n
Benchmarks  : ac aermod air capacita channel doduc fatigue gas_dyn induct
linpk mdbx nf protein rnflow test_fpu tfft
Maximum Times   :  300.0
Target Error %  :  0.200
Minimum Repeats : 2
Maximum Repeats : 5

   Benchmark   Compile  Executable   Ave Run  Number   Estim
Name(secs) (bytes)(secs) Repeats   Err %
   -   ---  --   --- ---  --
  ac  8.06   54576  8.11   2  0.0062
  aermod175.54 1480632 18.92   2  0.0106
 air 25.36   89992  6.76   2  0.0740
capacita 13.95  109536 40.32   2  0.0161
 channel  3.13   34448  2.95   5  0.1703
   doduc 27.31  212280 27.18   2  0.0331
 fatigue  9.82   77032  2.74   2  0.0182
 gas_dyn 24.86  144112  4.67   5  0.3052
  induct 24.25  189696 14.21   2  0.0035
   linpk  2.55   21536 21.69   2  0.0023
mdbx  9.17   84776 12.53   2  0.0239
  nf 34.21  124640 18.41   4  0.1634
 protein 28.01  155624 35.46   2  0.0310
  rnflow 38.11  183696 26.74   2  0.0037
test_fpu 19.63  141720 10.84   2  0.0323
tfft  1.69   22072  3.29   2  0.0152

Geometric Mean Execution Time =  11.57 seconds




[Bug fortran/48636] Enable more inlining with -O2 and higher

2011-04-17 Thread hubicka at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #3 from Jan Hubicka hubicka at gcc dot gnu.org 2011-04-17 
10:44:23 UTC ---
I am slowly starting to look into fortran issues now.  For years it was
non-issue since we had the non-one-decl-per-function problem. This is finally
solved

One additional problem is that we often hit large-stack frame limits because
the fortran i/o drops large datastructure on stack.  Consequently any functions
that do i/o (for debug purposes, for example) are not inlined into functions
that doesn't.  We will need to relax this.

- Consider heuristics to mark functions as inline in the front end:

- If it has many arguments (argument processing has a lot of effect)

- If it uses assumed-shape arrays (setting up that array descriptor
  may take a lot of time

- Mark everything as inline

I briefly discussed option of marking everything as inline on IRC for 4.6.x
series but it did not go well with Richard. Observation is that dominating
coding style in fortran is to not care that much about code size if perfomrance
improve and inlining do help here.

In longer term it would be cool if inliner was able to work out as much as
possible himself w/o frontend help.
The first item you mention is something backend can do at its own (and it
already knows how to benefit many arguments, but it proably does not do it
enough to make difference for fortran). I am just about commit patch that makes
backend by hair more sensitive on this.

The second item is interesting - it would be cool if backend was able to work
out that the code is supposed to simplify after inlining. Either by itself or
by frontend hint.
Can you provide me very simple testcase for that I can look into how it looks
like in backend?  Perhaps some kind of frontend hinting would work well here.

Honza


[Bug fortran/48636] Enable more inlining with -O2 and higher

2011-04-17 Thread tkoenig at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #4 from Thomas Koenig tkoenig at gcc dot gnu.org 2011-04-17 
13:32:11 UTC ---
(In reply to comment #3)

 The second item is interesting - it would be cool if backend was able to work
 out that the code is supposed to simplify after inlining. Either by itself or
 by frontend hint.
 Can you provide me very simple testcase for that I can look into how it looks
 like in backend?  Perhaps some kind of frontend hinting would work well here.

Here is some sample code (extreme, I admit) which profits a lot from
inlining:

- Strides are known to be one when inlining (a common case, but you can
  never be sure if the user doesn't call a(1:5:2))

- Expensive setting up of, and reading from the array descriptor

- Loops can be completely unrolled

module foo
  implicit none
contains
  subroutine bar(a,x)
real, dimension(:,:), intent(in) :: a
real, intent(out) :: x
integer :: i,j

x = 0
do j=1,ubound(a,2)
   do i=1,ubound(a,1)
  x = x + a(i,j)**2
   end do
end do
  end subroutine bar
end module foo

program main
  use foo
  implicit none
  real, dimension(2,3) :: a
  real :: x

  data a /1.0, 2.0, 3.0, -1.0, -2.0, -3.0/

  call bar(a,x)
  print *,x
end program main


[Bug fortran/48636] Enable more inlining with -O2 and higher

2011-04-17 Thread dominiq at lps dot ens.fr
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #5 from Dominique d'Humieres dominiq at lps dot ens.fr 2011-04-17 
14:12:30 UTC ---
I have investigated why test_fpu is slower with --param
max-inline-insns-auto=400 (11.18s) compared to -finline-limit=600 (10.84s) in
the timings of comment #2. This is due to the inlining of dgemm in the fourth
test Lapack 2:

[macbook] lin/test% gfc -Ofast -funroll-loops -fstack-arrays --param
max-inline-insns-auto=385 test_lap.f90
[macbook] lin/test% time a.out
  Benchmark running, hopefully as only ACTIVE task
Test4 - Lapack 2 (1001x1001) inverts  2.6 sec  Err= 0.250
 total =  2.6 sec

2.824u 0.081s 0:02.90 100.0%0+0k 0+0io 0pf+0w
[macbook] lin/test% gfc -Ofast -funroll-loops -fstack-arrays --param
max-inline-insns-auto=386 test_lap.f90
[macbook] lin/test% time a.out
  Benchmark running, hopefully as only ACTIVE task
Test4 - Lapack 2 (1001x1001) inverts  3.0 sec  Err= 0.250
 total =  3.0 sec

3.214u 0.082s 0:03.29 100.0%0+0k 0+0io 0pf+0w

Looking at the assembly, I see 'call_dgemm_' three times for 385 and none
for 386 (note there are only two calls in the code one in dgetri always inlined
and one in dgetrf not inlined). It would be interesting to understand why
inlining dgemm slows down the code.


[Bug fortran/48636] Enable more inlining with -O2 and higher

2011-04-16 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

Richard Guenther rguenth at gcc dot gnu.org changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2011.04.16 09:59:01
 Ever Confirmed|0   |1

--- Comment #1 from Richard Guenther rguenth at gcc dot gnu.org 2011-04-16 
09:59:01 UTC ---
Changing params from the Frontend won't work for LTO.  Declaring some or all
functions inline would work.

See also http://gcc.gnu.org/ml/gcc-patches/2011-02/msg00973.html and
discussion (the patch probably doesn't apply anymore).  Thus, I'd like to
address this in the middle-end, but if there are suitable heuristics on
when to set DECL_DECLARED_INLINE for the frontend by all means go ahead.