[Bug target/89578] [9 Regression] 5% runtime regression for 481.wrf at -Ofast -flto

2019-09-05 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89578

--- Comment #10 from Richard Biener  ---
Author: rguenth
Date: Thu Sep  5 12:11:52 2019
New Revision: 275405

URL: https://gcc.gnu.org/viewcvs?rev=275405=gcc=rev
Log:
2019-09-05  Richard Biener  

* lto-streamer.h (LTO_minor_version): Bump.

Backport from mainline
2019-05-06  Richard Biener  

PR tree-optimization/90328
* tree-data-ref.h (dr_may_alias_p): Pass in the actual loop nest.
* tree-data-ref.c (dr_may_alias_p): Check whether the clique
is valid in the loop nest before using it.
(initialize_data_dependence_relation): Adjust.
* graphite-scop-detection.c (build_alias_set): Pass the SCOP enclosing
loop as loop-nest to dr_may_alias_p.

* gcc.dg/torture/pr90328.c: New testcase.

2019-03-08  Richard Biener  

PR middle-end/89578
* cfgloop.h (struct loop): Add owned_clique field.
* cfgloopmanip.c (copy_loop_info): Copy it.
* tree-cfg.c (gimple_duplicate_bb): Do not remap owned_clique
cliques.
* tree-inline.c (copy_loops): Remap owned_clique.
* lto-streamer-in.c (input_cfg): Stream owned_clique.
* lto-streamer-out.c (output_cfg): Likewise.

2019-02-22  Richard Biener  

PR tree-optimization/87609
* tree-cfg.c (gimple_duplicate_bb): Only remap inlined cliques.

2019-02-22  Richard Biener  

PR middle-end/87609
* cfghooks.h (dependence_hash): New typedef.
(struct copy_bb_data): New type.
(cfg_hooks::duplicate_block): Adjust to take a copy_bb_data argument.
(duplicate_block): Likewise.
* cfghooks.c (duplicate_block): Pass down copy_bb_data.
(copy_bbs): Create and pass down copy_bb_data.
* cfgrtl.c (cfg_layout_duplicate_bb): Adjust.
(rtl_duplicate_bb): Likewise.
* tree-cfg.c (gimple_duplicate_bb): If the copy_bb_data arg is not NULL
remap dependence info.

* gcc.dg/torture/restrict-7.c: New testcase.

2019-02-22  Richard Biener  

PR tree-optimization/87609
* tree-core.h (tree_base): Document special clique values.
* tree-inline.c (remap_dependence_clique): Do not use the
special clique value of one.
(maybe_set_dependence_info): Use clique one.
(clear_dependence_clique): New callback.
(compute_dependence_clique): Clear clique one from all refs
before assigning it (again).

Added:
branches/gcc-7-branch/gcc/testsuite/gcc.dg/torture/pr90328.c
branches/gcc-7-branch/gcc/testsuite/gcc.dg/torture/restrict-7.c
Modified:
branches/gcc-7-branch/gcc/ChangeLog
branches/gcc-7-branch/gcc/cfghooks.c
branches/gcc-7-branch/gcc/cfghooks.h
branches/gcc-7-branch/gcc/cfgloop.h
branches/gcc-7-branch/gcc/cfgloopmanip.c
branches/gcc-7-branch/gcc/cfgrtl.c
branches/gcc-7-branch/gcc/graphite-scop-detection.c
branches/gcc-7-branch/gcc/lto-streamer-in.c
branches/gcc-7-branch/gcc/lto-streamer-out.c
branches/gcc-7-branch/gcc/lto-streamer.h
branches/gcc-7-branch/gcc/testsuite/ChangeLog
branches/gcc-7-branch/gcc/tree-cfg.c
branches/gcc-7-branch/gcc/tree-core.h
branches/gcc-7-branch/gcc/tree-data-ref.c
branches/gcc-7-branch/gcc/tree-data-ref.h
branches/gcc-7-branch/gcc/tree-inline.c
branches/gcc-7-branch/gcc/tree-ssa-structalias.c

[Bug target/89578] [9 Regression] 5% runtime regression for 481.wrf at -Ofast -flto

2019-08-30 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89578

--- Comment #9 from Richard Biener  ---
Author: rguenth
Date: Fri Aug 30 11:11:01 2019
New Revision: 275069

URL: https://gcc.gnu.org/viewcvs?rev=275069=gcc=rev
Log:
2019-08-30  Richard Biener  

* lto-streamer.h (LTO_minor_version): Bump.

Backport from mainline
2019-05-06  Richard Biener  

PR tree-optimization/90328
* tree-data-ref.h (dr_may_alias_p): Pass in the actual loop nest.
* tree-data-ref.c (dr_may_alias_p): Check whether the clique
is valid in the loop nest before using it.
(initialize_data_dependence_relation): Adjust.
* graphite-scop-detection.c (build_alias_set): Pass the SCOP enclosing
loop as loop-nest to dr_may_alias_p.

* gcc.dg/torture/pr90328.c: New testcase.

2019-03-08  Richard Biener  

PR middle-end/89578
* cfgloop.h (struct loop): Add owned_clique field.
* cfgloopmanip.c (copy_loop_info): Copy it.
* tree-cfg.c (gimple_duplicate_bb): Do not remap owned_clique
cliques.
* tree-inline.c (copy_loops): Remap owned_clique.
* lto-streamer-in.c (input_cfg): Stream owned_clique.
* lto-streamer-out.c (output_cfg): Likewise.

2019-02-22  Richard Biener  

PR tree-optimization/87609
* tree-cfg.c (gimple_duplicate_bb): Only remap inlined cliques.

2019-02-22  Richard Biener  

PR middle-end/87609
* cfghooks.h (dependence_hash): New typedef.
(struct copy_bb_data): New type.
(cfg_hooks::duplicate_block): Adjust to take a copy_bb_data argument.
(duplicate_block): Likewise.
* cfghooks.c (duplicate_block): Pass down copy_bb_data.
(copy_bbs): Create and pass down copy_bb_data.
* cfgrtl.c (cfg_layout_duplicate_bb): Adjust.
(rtl_duplicate_bb): Likewise.
* tree-cfg.c (gimple_duplicate_bb): If the copy_bb_data arg is not NULL
remap dependence info.

* gcc.dg/torture/restrict-7.c: New testcase.

2019-02-22  Richard Biener  

PR tree-optimization/87609
* tree-core.h (tree_base): Document special clique values.
* tree-inline.c (remap_dependence_clique): Do not use the
special clique value of one.
(maybe_set_dependence_info): Use clique one.
(clear_dependence_clique): New callback.
(compute_dependence_clique): Clear clique one from all refs
before assigning it (again).

Added:
branches/gcc-8-branch/gcc/testsuite/gcc.dg/torture/pr90328.c
branches/gcc-8-branch/gcc/testsuite/gcc.dg/torture/restrict-7.c
Modified:
branches/gcc-8-branch/gcc/ChangeLog
branches/gcc-8-branch/gcc/cfghooks.c
branches/gcc-8-branch/gcc/cfghooks.h
branches/gcc-8-branch/gcc/cfgloop.h
branches/gcc-8-branch/gcc/cfgloopmanip.c
branches/gcc-8-branch/gcc/cfgrtl.c
branches/gcc-8-branch/gcc/graphite-scop-detection.c
branches/gcc-8-branch/gcc/lto-streamer-in.c
branches/gcc-8-branch/gcc/lto-streamer-out.c
branches/gcc-8-branch/gcc/lto-streamer.h
branches/gcc-8-branch/gcc/testsuite/ChangeLog
branches/gcc-8-branch/gcc/tree-cfg.c
branches/gcc-8-branch/gcc/tree-core.h
branches/gcc-8-branch/gcc/tree-data-ref.c
branches/gcc-8-branch/gcc/tree-data-ref.h
branches/gcc-8-branch/gcc/tree-inline.c
branches/gcc-8-branch/gcc/tree-ssa-structalias.c

[Bug target/89578] [9 Regression] 5% runtime regression for 481.wrf at -Ofast -flto

2019-03-08 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89578

--- Comment #8 from Richard Biener  ---
Author: rguenth
Date: Fri Mar  8 10:20:12 2019
New Revision: 269484

URL: https://gcc.gnu.org/viewcvs?rev=269484=gcc=rev
Log:
2019-03-08  Richard Biener  

PR middle-end/89578
* cfgloop.h (struct loop): Add owned_clique field.
* cfgloopmanip.c (copy_loop_info): Copy it.
* tree-cfg.c (gimple_duplicate_bb): Do not remap owned_clique
cliques.
* tree-inline.c (copy_loops): Remap owned_clique.
* lto-streamer-in.c (input_cfg): Stream owned_clique.
* lto-streamer-out.c (output_cfg): Likewise.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/cfgloop.h
trunk/gcc/cfgloopmanip.c
trunk/gcc/lto-streamer-in.c
trunk/gcc/lto-streamer-out.c
trunk/gcc/tree-cfg.c
trunk/gcc/tree-inline.c

[Bug target/89578] [9 Regression] 5% runtime regression for 481.wrf at -Ofast -flto

2019-03-08 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89578

Richard Biener  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #7 from Richard Biener  ---
Fixed.

[Bug target/89578] [9 Regression] 5% runtime regression for 481.wrf at -Ofast -flto

2019-03-07 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89578

Richard Biener  changed:

   What|Removed |Added

 CC||matz at gcc dot gnu.org

--- Comment #6 from Richard Biener  ---
One difference that is clearly visible is missed vectorization of

 module_small_step_em.fppized.f90:1399:14: note:  LOOP VECTORIZED
 module_small_step_em.fppized.f90:1376:14: note:  LOOP VECTORIZED
-module_small_step_em.fppized.f90:1376:14: note:  LOOP VECTORIZED
 module_small_step_em.fppized.f90:1354:14: note:  LOOP VECTORIZED

This is the following inner loop which we vectorize before the change but not
after.  Before we have

create runtime check for data references MEM[(float[0:D.14326] *)_891][_13070]
and MEM[(float[0:D.14326] *)_891][_4424]
module_small_step_em.fppized.f90:1376:14: note:  created 1 versioning for alias
checks.
module_small_step_em.fppized.f90:1376:14: optimized:  loop versioned for
vectorization because of possible aliasing

where afterwards

module_small_step_em.fppized.f90:1376:14: missed:   number of versioning for
alias run-time tests exceeds 10 (--param vect-max-version-for-alias-checks)

SUBROUTINE advance_w( w, rw_tend, ww, u, v,   &
...
! Jammed 3 doubly nested loops over k/i into 1 for slight improvement
! in efficiency.  No change in results (bit-for-bit). JM 20040514
! (left a blank line where the other two k/i-loops were)
!
DO k=2,k_end
  DO i=i_start, i_end
w(i,k,j)=w(i,k,j)+dts*rw_tend(i,k,j)   &
 + msft_inv(i)*cqw(i,k,j)*(&
+.5*dts*g*mut_inv(i)*rdn(k)*   &
 (c2a(i,k  ,j)*rdnw(k  )   &
*((1.+epssm)*(rhs(i,k+1  )-rhs(i,k))   &
 +(1.-epssm)*(ph(i,k+1,j)-ph(i,k  ,j)))&
 -c2a(i,k-1,j)*rdnw(k-1)   &
*((1.+epssm)*(rhs(i,k)-rhs(i,k-1  ))   &
 +(1.-epssm)*(ph(i,k  ,j)-ph(i,k-1,j)  &

+dts*g*msft_inv(i)*(rdn(k)*&
 (c2a(i,k  ,j)*alt(i,k  ,j)*t_2ave(i,k  ,j)&
 -c2a(i,k-1,j)*alt(i,k-1,j)*t_2ave(i,k-1,j))   &
   +(rdn(k)*(c2a(i,k  ,j)*alb(i,k  ,j) &
-c2a(i,k-1,j)*alb(i,k-1,j))*mut_inv(i)-1.) &
 *muave(i,j))
  ENDDO
ENDDO


There is almost no difference in the IL after if-conversion but changes
like

-  _14697 = MEM[(float[0:D.14440] *)_959 clique 40 base 47][_15604];
+  _14697 = MEM[(float[0:D.14440] *)_959 clique 71 base 47][_15604];
   _9355 = _14697 * 4.905000209808349609375e+0;
-  _9371 = MEM[(float[0:D.14436] *)_957 clique 40 base 28][_15604];
+  _9371 = MEM[(float[0:D.14436] *)_957 clique 71 base 28][_15604];

which is introduced by unroll-and-jam by means of unrolling the outer loop
once.  The restrict clique remapping is strictly speaking required here.

There are of course cases where remapping can be elided but they are
difficult to reconstruct.

Basically each clique > 1 belongs to a specific inline block and whether
we need to remap it when copying a stmt depends on whether we duplicate
"within" that block or we duplicate the block itself.

A good "hint" would be (for example for unroll-and-jam) whether cliques
mentioned in the to be copied loop are also mentioned in stmts belonging
exclusively to the outer loop that is unrolled (but then they could have
been moved there by invariant motion for example).

I'm not sure if we want to look at the BLOCK tree.  Or whether we want to
build a similar thing into the loop structure, noting inline contexts
(the single clique that needs not remapping when copying the loop).

Note I haven't been able to verify whether the above loop is executed
at all but -fno-loop-unroll-and-jam makes performance equal (and faster
than ever...).

[Bug target/89578] [9 Regression] 5% runtime regression for 481.wrf at -Ofast -flto

2019-03-06 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89578

--- Comment #5 from Richard Biener  ---
So it looks like the change from r269097 to r269098 is only 2.5% (I've included
the followup fix ontop of r269098):

481.wrf 11170245   45.7 *   11170251   44.5 S
481.wrf 11170246   45.4 S   11170251   44.5 *

Checking r269096 reveals (BASE = r269096, PEAK = r269098 + followup):

481.wrf 11170244   45.7 S   11170254   44.0 S
481.wrf 11170244   45.8 *   11170252   44.3 *

Note that fortran guarantees on parameter aliasing are stronger than
those of restrict qualified pointers so using restrict isn't perfect
and it might suffer from the correctness fix unnecessarily.  In a
similar fashion performing inlining might make PTA do more conservative
choices than when looking at the fnspec guarantees.

For r269096 there's the possibility of simply never recomputing restrict
(gate it with !cfun->after_inlining).

Overall I see

   9.90%201223  wrf_peak.amd64-  wrf_peak.amd64-m64-gcc42-nn  [.]
solve_interface_
   8.50%172668  wrf_base.amd64-  wrf_base.amd64-m64-gcc42-nn  [.]
solve_interface_

which explains why we only see this with -flto (this function does nothing
but call other functions...).  It doesn't really explain why r269096 makes
such a big difference though (I guess it must be the PRE PTA recompute
triggering this).

But the function is a huge mess and the profile quite flat and similar...

[Bug target/89578] [9 Regression] 5% runtime regression for 481.wrf at -Ofast -flto

2019-03-06 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89578

Richard Biener  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |rguenth at gcc dot 
gnu.org

--- Comment #4 from Richard Biener  ---
Investigating, but I guess there's nothing to fix :/

[Bug target/89578] [9 Regression] 5% runtime regression for 481.wrf at -Ofast -flto

2019-03-04 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89578

Martin Liška  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2019-03-04
 Ever confirmed|0   |1

--- Comment #3 from Martin Liška  ---
(In reply to Richard Biener from comment #2)
> Suspicious changes include the fix for PR87609 and the new
> pass_remove_partial_avx_dependency.  I see no other relevant changes.

Yes, started with r269098.

[Bug target/89578] [9 Regression] 5% runtime regression for 481.wrf at -Ofast -flto

2019-03-04 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89578

Richard Biener  changed:

   What|Removed |Added

   Keywords|needs-bisection |
 Status|ASSIGNED|UNCONFIRMED
   Last reconfirmed|2019-03-04 00:00:00 |
 CC||hjl.tools at gmail dot com
   Assignee|marxin at gcc dot gnu.org  |unassigned at gcc dot 
gnu.org
   Target Milestone|--- |9.0
 Ever confirmed|1   |0

--- Comment #2 from Richard Biener  ---
Suspicious changes include the fix for PR87609 and the new
pass_remove_partial_avx_dependency.  I see no other relevant changes.

[Bug target/89578] [9 Regression] 5% runtime regression for 481.wrf at -Ofast -flto

2019-03-04 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89578

Martin Liška  changed:

   What|Removed |Added

   Keywords||needs-bisection
 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2019-03-04
 CC||marxin at gcc dot gnu.org
   Assignee|unassigned at gcc dot gnu.org  |marxin at gcc dot 
gnu.org
 Ever confirmed|0   |1

--- Comment #1 from Martin Liška  ---
Also seen here:
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=2.270.0=7.270.0=4.270.0=8.270.0

I'll bisect that.