[Bug target/89578] [9 Regression] 5% runtime regression for 481.wrf at -Ofast -flto
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89578 --- Comment #10 from Richard Biener --- Author: rguenth Date: Thu Sep 5 12:11:52 2019 New Revision: 275405 URL: https://gcc.gnu.org/viewcvs?rev=275405=gcc=rev Log: 2019-09-05 Richard Biener * lto-streamer.h (LTO_minor_version): Bump. Backport from mainline 2019-05-06 Richard Biener PR tree-optimization/90328 * tree-data-ref.h (dr_may_alias_p): Pass in the actual loop nest. * tree-data-ref.c (dr_may_alias_p): Check whether the clique is valid in the loop nest before using it. (initialize_data_dependence_relation): Adjust. * graphite-scop-detection.c (build_alias_set): Pass the SCOP enclosing loop as loop-nest to dr_may_alias_p. * gcc.dg/torture/pr90328.c: New testcase. 2019-03-08 Richard Biener PR middle-end/89578 * cfgloop.h (struct loop): Add owned_clique field. * cfgloopmanip.c (copy_loop_info): Copy it. * tree-cfg.c (gimple_duplicate_bb): Do not remap owned_clique cliques. * tree-inline.c (copy_loops): Remap owned_clique. * lto-streamer-in.c (input_cfg): Stream owned_clique. * lto-streamer-out.c (output_cfg): Likewise. 2019-02-22 Richard Biener PR tree-optimization/87609 * tree-cfg.c (gimple_duplicate_bb): Only remap inlined cliques. 2019-02-22 Richard Biener PR middle-end/87609 * cfghooks.h (dependence_hash): New typedef. (struct copy_bb_data): New type. (cfg_hooks::duplicate_block): Adjust to take a copy_bb_data argument. (duplicate_block): Likewise. * cfghooks.c (duplicate_block): Pass down copy_bb_data. (copy_bbs): Create and pass down copy_bb_data. * cfgrtl.c (cfg_layout_duplicate_bb): Adjust. (rtl_duplicate_bb): Likewise. * tree-cfg.c (gimple_duplicate_bb): If the copy_bb_data arg is not NULL remap dependence info. * gcc.dg/torture/restrict-7.c: New testcase. 2019-02-22 Richard Biener PR tree-optimization/87609 * tree-core.h (tree_base): Document special clique values. * tree-inline.c (remap_dependence_clique): Do not use the special clique value of one. (maybe_set_dependence_info): Use clique one. (clear_dependence_clique): New callback. (compute_dependence_clique): Clear clique one from all refs before assigning it (again). Added: branches/gcc-7-branch/gcc/testsuite/gcc.dg/torture/pr90328.c branches/gcc-7-branch/gcc/testsuite/gcc.dg/torture/restrict-7.c Modified: branches/gcc-7-branch/gcc/ChangeLog branches/gcc-7-branch/gcc/cfghooks.c branches/gcc-7-branch/gcc/cfghooks.h branches/gcc-7-branch/gcc/cfgloop.h branches/gcc-7-branch/gcc/cfgloopmanip.c branches/gcc-7-branch/gcc/cfgrtl.c branches/gcc-7-branch/gcc/graphite-scop-detection.c branches/gcc-7-branch/gcc/lto-streamer-in.c branches/gcc-7-branch/gcc/lto-streamer-out.c branches/gcc-7-branch/gcc/lto-streamer.h branches/gcc-7-branch/gcc/testsuite/ChangeLog branches/gcc-7-branch/gcc/tree-cfg.c branches/gcc-7-branch/gcc/tree-core.h branches/gcc-7-branch/gcc/tree-data-ref.c branches/gcc-7-branch/gcc/tree-data-ref.h branches/gcc-7-branch/gcc/tree-inline.c branches/gcc-7-branch/gcc/tree-ssa-structalias.c
[Bug target/89578] [9 Regression] 5% runtime regression for 481.wrf at -Ofast -flto
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89578 --- Comment #9 from Richard Biener --- Author: rguenth Date: Fri Aug 30 11:11:01 2019 New Revision: 275069 URL: https://gcc.gnu.org/viewcvs?rev=275069=gcc=rev Log: 2019-08-30 Richard Biener * lto-streamer.h (LTO_minor_version): Bump. Backport from mainline 2019-05-06 Richard Biener PR tree-optimization/90328 * tree-data-ref.h (dr_may_alias_p): Pass in the actual loop nest. * tree-data-ref.c (dr_may_alias_p): Check whether the clique is valid in the loop nest before using it. (initialize_data_dependence_relation): Adjust. * graphite-scop-detection.c (build_alias_set): Pass the SCOP enclosing loop as loop-nest to dr_may_alias_p. * gcc.dg/torture/pr90328.c: New testcase. 2019-03-08 Richard Biener PR middle-end/89578 * cfgloop.h (struct loop): Add owned_clique field. * cfgloopmanip.c (copy_loop_info): Copy it. * tree-cfg.c (gimple_duplicate_bb): Do not remap owned_clique cliques. * tree-inline.c (copy_loops): Remap owned_clique. * lto-streamer-in.c (input_cfg): Stream owned_clique. * lto-streamer-out.c (output_cfg): Likewise. 2019-02-22 Richard Biener PR tree-optimization/87609 * tree-cfg.c (gimple_duplicate_bb): Only remap inlined cliques. 2019-02-22 Richard Biener PR middle-end/87609 * cfghooks.h (dependence_hash): New typedef. (struct copy_bb_data): New type. (cfg_hooks::duplicate_block): Adjust to take a copy_bb_data argument. (duplicate_block): Likewise. * cfghooks.c (duplicate_block): Pass down copy_bb_data. (copy_bbs): Create and pass down copy_bb_data. * cfgrtl.c (cfg_layout_duplicate_bb): Adjust. (rtl_duplicate_bb): Likewise. * tree-cfg.c (gimple_duplicate_bb): If the copy_bb_data arg is not NULL remap dependence info. * gcc.dg/torture/restrict-7.c: New testcase. 2019-02-22 Richard Biener PR tree-optimization/87609 * tree-core.h (tree_base): Document special clique values. * tree-inline.c (remap_dependence_clique): Do not use the special clique value of one. (maybe_set_dependence_info): Use clique one. (clear_dependence_clique): New callback. (compute_dependence_clique): Clear clique one from all refs before assigning it (again). Added: branches/gcc-8-branch/gcc/testsuite/gcc.dg/torture/pr90328.c branches/gcc-8-branch/gcc/testsuite/gcc.dg/torture/restrict-7.c Modified: branches/gcc-8-branch/gcc/ChangeLog branches/gcc-8-branch/gcc/cfghooks.c branches/gcc-8-branch/gcc/cfghooks.h branches/gcc-8-branch/gcc/cfgloop.h branches/gcc-8-branch/gcc/cfgloopmanip.c branches/gcc-8-branch/gcc/cfgrtl.c branches/gcc-8-branch/gcc/graphite-scop-detection.c branches/gcc-8-branch/gcc/lto-streamer-in.c branches/gcc-8-branch/gcc/lto-streamer-out.c branches/gcc-8-branch/gcc/lto-streamer.h branches/gcc-8-branch/gcc/testsuite/ChangeLog branches/gcc-8-branch/gcc/tree-cfg.c branches/gcc-8-branch/gcc/tree-core.h branches/gcc-8-branch/gcc/tree-data-ref.c branches/gcc-8-branch/gcc/tree-data-ref.h branches/gcc-8-branch/gcc/tree-inline.c branches/gcc-8-branch/gcc/tree-ssa-structalias.c
[Bug target/89578] [9 Regression] 5% runtime regression for 481.wrf at -Ofast -flto
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89578 --- Comment #8 from Richard Biener --- Author: rguenth Date: Fri Mar 8 10:20:12 2019 New Revision: 269484 URL: https://gcc.gnu.org/viewcvs?rev=269484=gcc=rev Log: 2019-03-08 Richard Biener PR middle-end/89578 * cfgloop.h (struct loop): Add owned_clique field. * cfgloopmanip.c (copy_loop_info): Copy it. * tree-cfg.c (gimple_duplicate_bb): Do not remap owned_clique cliques. * tree-inline.c (copy_loops): Remap owned_clique. * lto-streamer-in.c (input_cfg): Stream owned_clique. * lto-streamer-out.c (output_cfg): Likewise. Modified: trunk/gcc/ChangeLog trunk/gcc/cfgloop.h trunk/gcc/cfgloopmanip.c trunk/gcc/lto-streamer-in.c trunk/gcc/lto-streamer-out.c trunk/gcc/tree-cfg.c trunk/gcc/tree-inline.c
[Bug target/89578] [9 Regression] 5% runtime regression for 481.wrf at -Ofast -flto
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89578 Richard Biener changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #7 from Richard Biener --- Fixed.
[Bug target/89578] [9 Regression] 5% runtime regression for 481.wrf at -Ofast -flto
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89578 Richard Biener changed: What|Removed |Added CC||matz at gcc dot gnu.org --- Comment #6 from Richard Biener --- One difference that is clearly visible is missed vectorization of module_small_step_em.fppized.f90:1399:14: note: LOOP VECTORIZED module_small_step_em.fppized.f90:1376:14: note: LOOP VECTORIZED -module_small_step_em.fppized.f90:1376:14: note: LOOP VECTORIZED module_small_step_em.fppized.f90:1354:14: note: LOOP VECTORIZED This is the following inner loop which we vectorize before the change but not after. Before we have create runtime check for data references MEM[(float[0:D.14326] *)_891][_13070] and MEM[(float[0:D.14326] *)_891][_4424] module_small_step_em.fppized.f90:1376:14: note: created 1 versioning for alias checks. module_small_step_em.fppized.f90:1376:14: optimized: loop versioned for vectorization because of possible aliasing where afterwards module_small_step_em.fppized.f90:1376:14: missed: number of versioning for alias run-time tests exceeds 10 (--param vect-max-version-for-alias-checks) SUBROUTINE advance_w( w, rw_tend, ww, u, v, & ... ! Jammed 3 doubly nested loops over k/i into 1 for slight improvement ! in efficiency. No change in results (bit-for-bit). JM 20040514 ! (left a blank line where the other two k/i-loops were) ! DO k=2,k_end DO i=i_start, i_end w(i,k,j)=w(i,k,j)+dts*rw_tend(i,k,j) & + msft_inv(i)*cqw(i,k,j)*(& +.5*dts*g*mut_inv(i)*rdn(k)* & (c2a(i,k ,j)*rdnw(k ) & *((1.+epssm)*(rhs(i,k+1 )-rhs(i,k)) & +(1.-epssm)*(ph(i,k+1,j)-ph(i,k ,j)))& -c2a(i,k-1,j)*rdnw(k-1) & *((1.+epssm)*(rhs(i,k)-rhs(i,k-1 )) & +(1.-epssm)*(ph(i,k ,j)-ph(i,k-1,j) & +dts*g*msft_inv(i)*(rdn(k)*& (c2a(i,k ,j)*alt(i,k ,j)*t_2ave(i,k ,j)& -c2a(i,k-1,j)*alt(i,k-1,j)*t_2ave(i,k-1,j)) & +(rdn(k)*(c2a(i,k ,j)*alb(i,k ,j) & -c2a(i,k-1,j)*alb(i,k-1,j))*mut_inv(i)-1.) & *muave(i,j)) ENDDO ENDDO There is almost no difference in the IL after if-conversion but changes like - _14697 = MEM[(float[0:D.14440] *)_959 clique 40 base 47][_15604]; + _14697 = MEM[(float[0:D.14440] *)_959 clique 71 base 47][_15604]; _9355 = _14697 * 4.905000209808349609375e+0; - _9371 = MEM[(float[0:D.14436] *)_957 clique 40 base 28][_15604]; + _9371 = MEM[(float[0:D.14436] *)_957 clique 71 base 28][_15604]; which is introduced by unroll-and-jam by means of unrolling the outer loop once. The restrict clique remapping is strictly speaking required here. There are of course cases where remapping can be elided but they are difficult to reconstruct. Basically each clique > 1 belongs to a specific inline block and whether we need to remap it when copying a stmt depends on whether we duplicate "within" that block or we duplicate the block itself. A good "hint" would be (for example for unroll-and-jam) whether cliques mentioned in the to be copied loop are also mentioned in stmts belonging exclusively to the outer loop that is unrolled (but then they could have been moved there by invariant motion for example). I'm not sure if we want to look at the BLOCK tree. Or whether we want to build a similar thing into the loop structure, noting inline contexts (the single clique that needs not remapping when copying the loop). Note I haven't been able to verify whether the above loop is executed at all but -fno-loop-unroll-and-jam makes performance equal (and faster than ever...).
[Bug target/89578] [9 Regression] 5% runtime regression for 481.wrf at -Ofast -flto
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89578 --- Comment #5 from Richard Biener --- So it looks like the change from r269097 to r269098 is only 2.5% (I've included the followup fix ontop of r269098): 481.wrf 11170245 45.7 * 11170251 44.5 S 481.wrf 11170246 45.4 S 11170251 44.5 * Checking r269096 reveals (BASE = r269096, PEAK = r269098 + followup): 481.wrf 11170244 45.7 S 11170254 44.0 S 481.wrf 11170244 45.8 * 11170252 44.3 * Note that fortran guarantees on parameter aliasing are stronger than those of restrict qualified pointers so using restrict isn't perfect and it might suffer from the correctness fix unnecessarily. In a similar fashion performing inlining might make PTA do more conservative choices than when looking at the fnspec guarantees. For r269096 there's the possibility of simply never recomputing restrict (gate it with !cfun->after_inlining). Overall I see 9.90%201223 wrf_peak.amd64- wrf_peak.amd64-m64-gcc42-nn [.] solve_interface_ 8.50%172668 wrf_base.amd64- wrf_base.amd64-m64-gcc42-nn [.] solve_interface_ which explains why we only see this with -flto (this function does nothing but call other functions...). It doesn't really explain why r269096 makes such a big difference though (I guess it must be the PRE PTA recompute triggering this). But the function is a huge mess and the profile quite flat and similar...
[Bug target/89578] [9 Regression] 5% runtime regression for 481.wrf at -Ofast -flto
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89578 Richard Biener changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org --- Comment #4 from Richard Biener --- Investigating, but I guess there's nothing to fix :/
[Bug target/89578] [9 Regression] 5% runtime regression for 481.wrf at -Ofast -flto
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89578 Martin Liška changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2019-03-04 Ever confirmed|0 |1 --- Comment #3 from Martin Liška --- (In reply to Richard Biener from comment #2) > Suspicious changes include the fix for PR87609 and the new > pass_remove_partial_avx_dependency. I see no other relevant changes. Yes, started with r269098.
[Bug target/89578] [9 Regression] 5% runtime regression for 481.wrf at -Ofast -flto
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89578 Richard Biener changed: What|Removed |Added Keywords|needs-bisection | Status|ASSIGNED|UNCONFIRMED Last reconfirmed|2019-03-04 00:00:00 | CC||hjl.tools at gmail dot com Assignee|marxin at gcc dot gnu.org |unassigned at gcc dot gnu.org Target Milestone|--- |9.0 Ever confirmed|1 |0 --- Comment #2 from Richard Biener --- Suspicious changes include the fix for PR87609 and the new pass_remove_partial_avx_dependency. I see no other relevant changes.
[Bug target/89578] [9 Regression] 5% runtime regression for 481.wrf at -Ofast -flto
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89578 Martin Liška changed: What|Removed |Added Keywords||needs-bisection Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2019-03-04 CC||marxin at gcc dot gnu.org Assignee|unassigned at gcc dot gnu.org |marxin at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #1 from Martin Liška --- Also seen here: https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=2.270.0=7.270.0=4.270.0=8.270.0 I'll bisect that.