[Bug tree-optimization/79390] [7 Regression] 10% performance drop in SciMark2 LU after r242550
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79390 --- Comment #32 from Richard Biener --- (In reply to Jakub Jelinek from comment #30) > I didn't close it because I wanted to see updated benchmark numbers. Either > I'll grab the benchmark, or if somebody else posts the latest numbers, we > can close it or keep open depending on that. gcc7 -O3:LU Mflops: 5444.74 gcc7 -Ofast: LU Mflops: 5385.51 gcc6 -O3:LU Mflops: 5515.91 gcc6 -Ofast: LU Mflops: 5487.94 so there's a <2% regression remaining (noise level is ~0.5%).
[Bug tree-optimization/79390] [7 Regression] 10% performance drop in SciMark2 LU after r242550
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79390 Richard Biener changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #31 from Richard Biener --- Yep, looks fixed on the tester.
[Bug tree-optimization/79390] [7 Regression] 10% performance drop in SciMark2 LU after r242550
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79390 --- Comment #30 from Jakub Jelinek --- I didn't close it because I wanted to see updated benchmark numbers. Either I'll grab the benchmark, or if somebody else posts the latest numbers, we can close it or keep open depending on that.
[Bug tree-optimization/79390] [7 Regression] 10% performance drop in SciMark2 LU after r242550
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79390 Jeffrey A. Law changed: What|Removed |Added CC||law at redhat dot com --- Comment #29 from Jeffrey A. Law --- Jakub's fix addresses the last remaining issue IIUC. Should we close this out?
[Bug tree-optimization/79390] [7 Regression] 10% performance drop in SciMark2 LU after r242550
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79390 --- Comment #28 from Jakub Jelinek --- Author: jakub Date: Wed Apr 12 18:09:47 2017 New Revision: 246882 URL: https://gcc.gnu.org/viewcvs?rev=246882=gcc=rev Log: PR tree-optimization/79390 * optabs.c (emit_conditional_move): If the preferred op2/op3 operand order does not result in usable sequence, retry with reversed operand order. * gcc.target/i386/pr70465-2.c: Xfail the scan-assembler-not test. Modified: trunk/gcc/ChangeLog trunk/gcc/optabs.c trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.target/i386/pr70465-2.c
[Bug tree-optimization/79390] [7 Regression] 10% performance drop in SciMark2 LU after r242550
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79390 --- Comment #27 from rguenther at suse dot de --- On Wed, 12 Apr 2017, jakub at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79390 > > --- Comment #26 from Jakub Jelinek --- > Created attachment 41189 > --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41189=edit > gcc7-pr79390-ecm.patch > > Untested fix for the -O3 -ffast-math -march=haswell case. > The difference between -fno-fast-math and -ffast-math is in: > if (swap_commutative_operands_p (op2, op3) > && ((reversed = reversed_comparison_code_parts (code, op0, op1, NULL)) > != UNKNOWN)) > { > std::swap (op2, op3); > code = reversed; > } > > swap_commutative_operands_p is true in both cases, but without > -ffast-math reversed_comparison_code_parts fails (returns UNKNOWN), so we > don't > try that order and succeed, while with -ffast-math it doesn't fail, returns > LE, > but we reject it in the predicates of the cmov insn and thus don't emit > anything. This patch just retries with the other order of operands in that > case. Looks sensible.
[Bug tree-optimization/79390] [7 Regression] 10% performance drop in SciMark2 LU after r242550
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79390 --- Comment #26 from Jakub Jelinek --- Created attachment 41189 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41189=edit gcc7-pr79390-ecm.patch Untested fix for the -O3 -ffast-math -march=haswell case. The difference between -fno-fast-math and -ffast-math is in: if (swap_commutative_operands_p (op2, op3) && ((reversed = reversed_comparison_code_parts (code, op0, op1, NULL)) != UNKNOWN)) { std::swap (op2, op3); code = reversed; } swap_commutative_operands_p is true in both cases, but without -ffast-math reversed_comparison_code_parts fails (returns UNKNOWN), so we don't try that order and succeed, while with -ffast-math it doesn't fail, returns LE, but we reject it in the predicates of the cmov insn and thus don't emit anything. This patch just retries with the other order of operands in that case.
[Bug tree-optimization/79390] [7 Regression] 10% performance drop in SciMark2 LU after r242550
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79390 Richard Biener changed: What|Removed |Added Priority|P1 |P2
[Bug tree-optimization/79390] [7 Regression] 10% performance drop in SciMark2 LU after r242550
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79390 --- Comment #25 from Richard Biener --- So the original report is fixed (-O3 -march-native). But adding -ffast-math still ends up regressing. At this point it's probably appropriate to re-target to GCC 8.
[Bug tree-optimization/79390] [7 Regression] 10% performance drop in SciMark2 LU after r242550
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79390 --- Comment #24 from Richard Biener --- Author: rguenth Date: Wed Apr 12 09:41:02 2017 New Revision: 246869 URL: https://gcc.gnu.org/viewcvs?rev=246869=gcc=rev Log: 2017-04-12 Richard BienerPR tree-optimization/79390 * gimple-ssa-split-paths.c (is_feasible_trace): Restrict threading case even more. Modified: trunk/gcc/ChangeLog trunk/gcc/gimple-ssa-split-paths.c
[Bug tree-optimization/79390] [7 Regression] 10% performance drop in SciMark2 LU after r242550
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79390 --- Comment #23 from Richard Biener --- (In reply to Richard Biener from comment #22) > (In reply to rguent...@suse.de from comment #21) > > On April 7, 2017 6:57:13 PM GMT+02:00, "jakub at gcc dot gnu.org" > >wrote: > > >https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79390 > > > > > >--- Comment #20 from Jakub Jelinek --- > > >So, Richard, any thoughts on what can be done split paths to avoid > > >this? > > > > Invent some new heuristic that avoids splitting this case... > > Index: gcc/gimple-ssa-split-paths.c > === > --- gcc/gimple-ssa-split-paths.c(revision 246803) > +++ gcc/gimple-ssa-split-paths.c(working copy) > @@ -249,13 +249,17 @@ is_feasible_trace (basic_block bb) > imm_use_iterator iter2; > FOR_EACH_IMM_USE_FAST (use2_p, iter2, gimple_phi_result > (stmt)) > { > - if (is_gimple_debug (USE_STMT (use2_p))) > + gimple *use_stmt = USE_STMT (use2_p); > + if (is_gimple_debug (use_stmt)) > continue; > - basic_block use_bb = gimple_bb (USE_STMT (use2_p)); > + basic_block use_bb = gimple_bb (use_stmt); > if (use_bb != bb > && dominated_by_p (CDI_DOMINATORS, bb, use_bb)) > { > - found_useful_phi = true; > + if (gcond *cond = dyn_cast (use_stmt)) > + if (gimple_cond_code (cond) == EQ_EXPR > + || gimple_cond_code (cond) == NE_EXPR) > + found_useful_phi = true; > break; > } > } > > avoids the splitting at at least passes tree-ssa.exp testing. Throwing it > on full testing (there are some path splitting testcases randomly placed > IIRC). Bootstrap / regtest went ok. With this and -O3 -march=native (on a broadwell CPU) I get gcc6 -O3 -march=native: 5469.25 Mflops gcc7 -O3 -march=native: 5439.39 Mflops but note that with -Ofast -march=native the situation is still bad (-fno-split-paths doesn't help but -ftree-loop-if-convert does): gcc6 -Ofast -march=native: 5500.51 Mflops gcc7 -Ofast -march=native: 4765.56 Mflops gcc7 -Ofast -march=native -ftree-loop-if-convert: 5335.49 Mflops Shall I go for the split-path fix for the moment?
[Bug tree-optimization/79390] [7 Regression] 10% performance drop in SciMark2 LU after r242550
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79390 --- Comment #22 from Richard Biener --- (In reply to rguent...@suse.de from comment #21) > On April 7, 2017 6:57:13 PM GMT+02:00, "jakub at gcc dot gnu.org" >wrote: > >https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79390 > > > >--- Comment #20 from Jakub Jelinek --- > >So, Richard, any thoughts on what can be done split paths to avoid > >this? > > Invent some new heuristic that avoids splitting this case... Index: gcc/gimple-ssa-split-paths.c === --- gcc/gimple-ssa-split-paths.c(revision 246803) +++ gcc/gimple-ssa-split-paths.c(working copy) @@ -249,13 +249,17 @@ is_feasible_trace (basic_block bb) imm_use_iterator iter2; FOR_EACH_IMM_USE_FAST (use2_p, iter2, gimple_phi_result (stmt)) { - if (is_gimple_debug (USE_STMT (use2_p))) + gimple *use_stmt = USE_STMT (use2_p); + if (is_gimple_debug (use_stmt)) continue; - basic_block use_bb = gimple_bb (USE_STMT (use2_p)); + basic_block use_bb = gimple_bb (use_stmt); if (use_bb != bb && dominated_by_p (CDI_DOMINATORS, bb, use_bb)) { - found_useful_phi = true; + if (gcond *cond = dyn_cast (use_stmt)) + if (gimple_cond_code (cond) == EQ_EXPR + || gimple_cond_code (cond) == NE_EXPR) + found_useful_phi = true; break; } } avoids the splitting at at least passes tree-ssa.exp testing. Throwing it on full testing (there are some path splitting testcases randomly placed IIRC).
[Bug tree-optimization/79390] [7 Regression] 10% performance drop in SciMark2 LU after r242550
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79390 --- Comment #21 from rguenther at suse dot de --- On April 7, 2017 6:57:13 PM GMT+02:00, "jakub at gcc dot gnu.org"wrote: >https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79390 > >--- Comment #20 from Jakub Jelinek --- >So, Richard, any thoughts on what can be done split paths to avoid >this? Invent some new heuristic that avoids splitting this case...
[Bug tree-optimization/79390] [7 Regression] 10% performance drop in SciMark2 LU after r242550
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79390 --- Comment #20 from Jakub Jelinek --- So, Richard, any thoughts on what can be done split paths to avoid this?
[Bug tree-optimization/79390] [7 Regression] 10% performance drop in SciMark2 LU after r242550
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79390 --- Comment #19 from vincenzo Innocente --- Could you please have a look also to c++ and lto: this is what I get on my skylake: for c++ or lto -fno-split-paths pessimizes [innocent@vinavx3 scimark2TMP]$ gcc -march=native -Wall -Ofast *.c -lm ; ./a.out | grep LU LU Mflops: 5920.14(M=100, N=100) [innocent@vinavx3 scimark2TMP]$ gcc -march=native -Wall -Ofast *.c -lm -fno-split-paths ; ./a.out | grep LU LU Mflops: 6136.33(M=100, N=100) [innocent@vinavx3 scimark2TMP]$ gcc -march=native -Wall -Ofast *.c -lm -flto ; ./a.out | grep LU LU Mflops: 5809.93(M=100, N=100) [innocent@vinavx3 scimark2TMP]$ gcc -march=native -Wall -Ofast *.c -lm -flto -fno-split-paths ; ./a.out | grep LU LU Mflops: 5630.24(M=100, N=100) [innocent@vinavx3 scimark2TMP]$ c++ -march=native -Wall -Ofast *.c -lm ; ./a.out | grep LU LU Mflops: 6001.47(M=100, N=100) [innocent@vinavx3 scimark2TMP]$ c++ -march=native -Wall -Ofast *.c -lm -fno-split-paths ; ./a.out | grep LU LU Mflops: 5920.14(M=100, N=100) [innocent@vinavx3 scimark2TMP]$ c++ -march=native -Wall -Ofast *.c -lm -flto; ./a.out | grep LU LU Mflops: 5434.16(M=100, N=100) [innocent@vinavx3 scimark2TMP]$ c++ -march=native -Wall -Ofast *.c -lm -flto -fno-split-paths ; ./a.out | grep LU LU Mflops: 5434.16(M=100, N=100)
[Bug tree-optimization/79390] [7 Regression] 10% performance drop in SciMark2 LU after r242550
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79390 --- Comment #18 from rguenther at suse dot de --- On Fri, 7 Apr 2017, jakub at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79390 > > --- Comment #16 from Jakub Jelinek --- > Has somebody the benchmark around to retry with current trunk, with > -f{,no-}split-paths and compare that to some older trunk and gcc6? On a broadwell machine I get (-O3 -march=native) gcc6: 5507.42 Mflops gcc7: 4787.26 Mflops gcc7: 5435.08 Mflops [-fno-split-paths] so the RTL if-conversion works now unless inhibited by path splitting. What path splitting does is mostly undone by loop disambiguation which re-creates the merger so path splitting just made the loop multiple exit (without simplifying the duplicated exit condition). So we can add more heuristics to tame down loop splitting, for example never duplicating a joiner that has an exit. Or adding to the quite stupid if-cvt mitigation code (missing the minmax case). Or add even more outs to the threading opportunity detection code... We currently find that t_175 = PHIin the merger exposes a threading opportunity because it has one arg that is unchanged over the latch (t_184 over 6->8) and it has a use in the threading destination (in the controlling condition even). This all just exposes that path splitting is not well integrated into what it tries to expose (threading). IMHO it should have been part of backwards/forward threading. But that ship has sailed (Jeff approved it). I've tried to fixup after the MIA authors. But well. I can fixup by removing the pass again. Or adding more oddball heuristics. This case which seems important for x86_64 is for (i=j+1; i t) { jp = i; t = ab; } } so reducing MAX plus remembering the index of the maximum value. We're not phiopt-ing that to MAX because it might not be profitable (the condition has to remain). So path splitting could be profitable on some archs. IFF we wouldn't re-create that shared latch right afterwards anyway (and forget to propagate single-arg PHIs resulting from the BB duplication).
[Bug tree-optimization/79390] [7 Regression] 10% performance drop in SciMark2 LU after r242550
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79390 --- Comment #17 from vincenzo Innocente --- [innocent@vinavx3 innocent]$ mkdir scimark2TMP [innocent@vinavx3 innocent]$ cd scimark2TMP [innocent@vinavx3 scimark2TMP]$ wget http://math.nist.gov/scimark2/scimark2_1c.zip . . gcc version 7.0.1 20170407 (experimental) [trunk revision 246752] (GCC) [innocent@vinavx3 scimark2TMP]$ gcc -Ofast -march=haswell *.c -lm [innocent@vinavx3 scimark2TMP]$ ./a.out ** ** ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark ** ** for details. (Results can be submitted to p...@nist.gov) ** ** ** Using 2.00 seconds min time per kenel. Composite Score: 2783.60 FFT Mflops: 2325.65(N=1024) SOR Mflops: 2260.36(100 x 100) MonteCarlo: Mflops: 829.14 Sparse matmult Mflops: 2582.70(N=1000, nz=5000) LU Mflops: 5920.14(M=100, N=100) [innocent@vinavx3 scimark2TMP]$ gcc -Ofast -march=haswell *.c -lm -fno-split-paths [innocent@vinavx3 scimark2TMP]$ ./a.out ** ** ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark ** ** for details. (Results can be submitted to p...@nist.gov) ** ** ** Using 2.00 seconds min time per kenel. Composite Score: 2825.86 FFT Mflops: 2333.43(N=1024) SOR Mflops: 2260.36(100 x 100) MonteCarlo: Mflops: 829.14 Sparse matmult Mflops: 2570.04(N=1000, nz=5000) LU Mflops: 6136.33(M=100, N=100) [innocent@vinavx3 scimark2TMP]$ gcc -Ofast -march=haswell *.c -lm -fsplit-paths [innocent@vinavx3 scimark2TMP]$ ./a.out ** ** ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark ** ** for details. (Results can be submitted to p...@nist.gov) ** ** ** Using 2.00 seconds min time per kenel. Composite Score: 2787.46 FFT Mflops: 2325.65(N=1024) SOR Mflops: 2260.36(100 x 100) MonteCarlo: Mflops: 832.36 Sparse matmult Mflops: 2582.70(N=1000, nz=5000) LU Mflops: 5936.23(M=100, N=100) [innocent@vinavx3 scimark2TMP]$ pushd ~/code/s7/C CMSSW_8_0_22/ CMSSW_9_1_0_pre2/ [innocent@vinavx3 scimark2TMP]$ pushd ~/code/s7/CMSSW_9_1_0_pre2/ ~/code/s7/CMSSW_9_1_0_pre2 /tmp/innocent/scimark2TMP [innocent@vinavx3 CMSSW_9_1_0_pre2]$ cmsenv [innocent@vinavx3 CMSSW_9_1_0_pre2]$ popd /tmp/innocent/scimark2TMP [innocent@vinavx3 scimark2TMP]$ gcc -v gcc version 6.3.0 (GCC) [innocent@vinavx3 scimark2TMP]$ gcc -Ofast -march=haswell *.c -lm [innocent@vinavx3 scimark2TMP]$ ./a.out ** ** ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark ** ** for details. (Results can be submitted to p...@nist.gov) ** ** ** Using 2.00 seconds min time per kenel. Composite Score: 2820.21 FFT Mflops: 2325.65(N=1024) SOR Mflops: 2260.36(100 x 100) MonteCarlo: Mflops: 810.37 Sparse matmult Mflops: 2427.26(N=1000, nz=5000) LU Mflops: 6277.39(M=100, N=100)
[Bug tree-optimization/79390] [7 Regression] 10% performance drop in SciMark2 LU after r242550
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79390 --- Comment #16 from Jakub Jelinek --- Has somebody the benchmark around to retry with current trunk, with -f{,no-}split-paths and compare that to some older trunk and gcc6?
[Bug tree-optimization/79390] [7 Regression] 10% performance drop in SciMark2 LU after r242550
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79390 --- Comment #15 from Rainer Orth --- Author: ro Date: Thu Apr 6 13:11:21 2017 New Revision: 246729 URL: https://gcc.gnu.org/viewcvs?rev=246729=gcc=rev Log: Fix gcc.target/i386/pr79390.c for Solaris as PR tree-optimization/79390 * gcc.target/i386/pr79390.c: Allow for cmovl.a. Modified: trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.target/i386/pr79390.c
[Bug tree-optimization/79390] [7 Regression] 10% performance drop in SciMark2 LU after r242550
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79390 --- Comment #14 from Jakub Jelinek --- Author: jakub Date: Tue Apr 4 17:52:27 2017 New Revision: 246686 URL: https://gcc.gnu.org/viewcvs?rev=246686=gcc=rev Log: PR tree-optimization/79390 * target.h (struct noce_if_info): Declare. * targhooks.h (default_noce_conversion_profitable_p): Declare. * target.def (noce_conversion_profitable_p): New target hook. * ifcvt.h (struct noce_if_info): New type, moved from ... * ifcvt.c (struct noce_if_info): ... here. (noce_conversion_profitable_p): Renamed to ... (default_noce_conversion_profitable_p): ... this. No longer static nor inline. (noce_try_store_flag_constants, noce_try_addcc, noce_try_store_flag_mask, noce_try_cmove, noce_try_cmove_arith, noce_convert_multiple_sets): Use targetm.noce_conversion_profitable_p instead of noce_conversion_profitable_p. * config/i386/i386.c: Include ifcvt.h. (ix86_option_override_internal): Don't override PARAM_MAX_RTL_IF_CONVERSION_INSNS default. (ix86_noce_conversion_profitable_p): New function. (TARGET_NOCE_CONVERSION_PROFITABLE_P): Redefine. * config/i386/x86-tune.def (X86_TUNE_ONE_IF_CONV_INSN): Adjust comment. * doc/tm.texi.in (TARGET_NOCE_CONVERSION_PROFITABLE_P): Add. * doc/tm.texi: Regenerated. * gcc.target/i386/pr79390.c: New test. * gcc.dg/ifcvt-4.c: Use -mtune-ctrl=^one_if_conv_insn for i?86/x86_64. Added: trunk/gcc/testsuite/gcc.target/i386/pr79390.c Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/i386.c trunk/gcc/config/i386/x86-tune.def trunk/gcc/doc/tm.texi trunk/gcc/doc/tm.texi.in trunk/gcc/ifcvt.c trunk/gcc/ifcvt.h trunk/gcc/target.def trunk/gcc/target.h trunk/gcc/targhooks.h trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.dg/ifcvt-4.c
[Bug tree-optimization/79390] [7 Regression] 10% performance drop in SciMark2 LU after r242550
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79390 --- Comment #13 from Jakub Jelinek --- Created attachment 41097 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41097=edit gcc7-pr79390.patch Untested fix.
[Bug tree-optimization/79390] [7 Regression] 10% performance drop in SciMark2 LU after r242550
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79390 Richard Biener changed: What|Removed |Added Priority|P3 |P1
[Bug tree-optimization/79390] [7 Regression] 10% performance drop in SciMark2 LU after r242550
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79390 Richard Biener changed: What|Removed |Added Known to work||6.3.1 Target Milestone|--- |7.0 Summary|10% performance drop in |[7 Regression] 10% |SciMark2 LU after r242550 |performance drop in ||SciMark2 LU after r242550 --- Comment #12 from Richard Biener --- On more recent trunk -fno-split-paths makes only a tiny difference (4882 vs. 4779 Mflops) while -ftree-loop-if-convert still results in 5432 Mflops. GCC 6 scores 5523 Mflops for me (-O3 -march=native on a Broadwell CPU). Marking as regression.