[Bug tree-optimization/77498] [7 regression] Performance drop after r239414 on spec2000/172mgrid
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77498 Richard Biener changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #16 from Richard Biener --- Fixed.
[Bug tree-optimization/77498] [7 regression] Performance drop after r239414 on spec2000/172mgrid
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77498 --- Comment #15 from Richard Biener --- Author: rguenth Date: Thu Mar 30 07:15:39 2017 New Revision: 246583 URL: https://gcc.gnu.org/viewcvs?rev=246583&root=gcc&view=rev Log: 2017-03-30 Richard Biener PR tree-optimization/77498 * tree-ssa-pre.c (phi_translate_1): Do not allow simplifications to non-constants over backedges. * gfortran.dg/pr77498.f: New testcase. Added: trunk/gcc/testsuite/gfortran.dg/pr77498.f Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-ssa-pre.c
[Bug tree-optimization/77498] [7 regression] Performance drop after r239414 on spec2000/172mgrid
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77498 Ramana Radhakrishnan changed: What|Removed |Added Target|arm-none-eabi | CC||ramana at gcc dot gnu.org --- Comment #14 from Ramana Radhakrishnan --- I don't think arm is a valid target for this given PR80155 was opened as a consequence of fixing PR77498..
[Bug tree-optimization/77498] [7 regression] Performance drop after r239414 on spec2000/172mgrid
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77498 Thomas Preud'homme changed: What|Removed |Added CC|thopre01 at gcc dot gnu.org| --- Comment #13 from Thomas Preud'homme --- Ack, thanks Richard. Opened PR80155
[Bug tree-optimization/77498] [7 regression] Performance drop after r239414 on spec2000/172mgrid
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77498 --- Comment #12 from rguenther at suse dot de --- On Wed, 22 Mar 2017, thopre01 at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77498 > > --- Comment #11 from Thomas Preud'homme --- > (In reply to Thomas Preud'homme from comment #9) > > Sadly I could not come up with a minimal testcase so far. What I can see > > from the code is that tree code hoisting increases the live range of some > > values which then translates into more spilling in reload. > > > > As an approximation I'm wondering if the maximum distance (computer in > > number of blocks traversed) from the definition to the use could be used to > > limit when the optimization is applied when optimizing for speed. > > I finally managed. The bug can be reproduced by building the following for > arm-none-eabi with -S -O2 -mcpu=cortex-m7 and looking for the push in the > resulting assembly code. > > fn1() { > char *a; > char b; > for (; *a; a++) { > if (b) > a++; > fn2(); > } > } > > With -O2: r3, r4, r5 and lr and pushed. > With -O2 -fno-code-hoisting: r4 and lr are pushed only. > > > Similarly for -mcpu=cortex-m0plus: > > enum { ENUM1, ENUM2, ENUM3 } a; > fn1() { > char *b; > for (; *b && a != ENUM2; b++) > switch (a) { > case ENUM1: a = ENUM3; > } > } But that's not caused by r239414 so please open a new bug for this. (confirmed with a cross) Transform: [85.00%]: # a_14 = PHI if (b_7(D) != 0) goto ; [50.00%] else goto ; [50.00%] [42.50%]: goto ; [100.00%] [42.50%]: a_8 = a_14 + 1; [85.00%]: # a_2 = PHI fn2 (); a_10 = a_2 + 1; to [85.00%]: # a_14 = PHI _4 = a_14 + 1; if (b_7(D) != 0) goto ; [50.00%] else goto ; [50.00%] [42.50%]: _3 = _4 + 1; [85.00%]: # a_2 = PHI # prephitmp_12 = PHI <_4(3), _3(4)> fn2 (); that's because the hoisting (which itself isn't a problem) makes a_2 + 1 partially redundant over the latch. We see this issue in related testcases where PRE can compute a constant for the first iteration value of expressions and thus inserts IVs for them. So it's nothing new and a fix would hopefully fix those cases as well.
[Bug tree-optimization/77498] [7 regression] Performance drop after r239414 on spec2000/172mgrid
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77498 --- Comment #11 from Thomas Preud'homme --- (In reply to Thomas Preud'homme from comment #9) > Sadly I could not come up with a minimal testcase so far. What I can see > from the code is that tree code hoisting increases the live range of some > values which then translates into more spilling in reload. > > As an approximation I'm wondering if the maximum distance (computer in > number of blocks traversed) from the definition to the use could be used to > limit when the optimization is applied when optimizing for speed. I finally managed. The bug can be reproduced by building the following for arm-none-eabi with -S -O2 -mcpu=cortex-m7 and looking for the push in the resulting assembly code. fn1() { char *a; char b; for (; *a; a++) { if (b) a++; fn2(); } } With -O2: r3, r4, r5 and lr and pushed. With -O2 -fno-code-hoisting: r4 and lr are pushed only. Similarly for -mcpu=cortex-m0plus: enum { ENUM1, ENUM2, ENUM3 } a; fn1() { char *b; for (; *b && a != ENUM2; b++) switch (a) { case ENUM1: a = ENUM3; } }
[Bug tree-optimization/77498] [7 regression] Performance drop after r239414 on spec2000/172mgrid
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77498 --- Comment #10 from rguenther at suse dot de --- On Mon, 20 Mar 2017, thopre01 at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77498 > > --- Comment #9 from Thomas Preud'homme --- > Sadly I could not come up with a minimal testcase so far. What I can see from > the code is that tree code hoisting increases the live range of some values > which then translates into more spilling in reload. > > As an approximation I'm wondering if the maximum distance (computer in number > of blocks traversed) from the definition to the use could be used to limit > when > the optimization is applied when optimizing for speed. Sadly the data-flow used to compute the opportunities is not suitable for determining this. It would probaly require "aging" of exprs in the hoistable sets when propagating the dataflow for ANTIC_IN (in principle PRE would have a similar issue). We already restrict "distance" by requiring at least one successor of the hoisting point to provide the value directly but we do not limit proximity further. See do_hoist_insertion.
[Bug tree-optimization/77498] [7 regression] Performance drop after r239414 on spec2000/172mgrid
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77498 --- Comment #9 from Thomas Preud'homme --- Sadly I could not come up with a minimal testcase so far. What I can see from the code is that tree code hoisting increases the live range of some values which then translates into more spilling in reload. As an approximation I'm wondering if the maximum distance (computer in number of blocks traversed) from the definition to the use could be used to limit when the optimization is applied when optimizing for speed.
[Bug tree-optimization/77498] [7 regression] Performance drop after r239414 on spec2000/172mgrid
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77498 Thomas Preud'homme changed: What|Removed |Added CC||thopre01 at gcc dot gnu.org --- Comment #8 from Thomas Preud'homme --- (In reply to Richard Biener from comment #7) > Ok, so given we can't have PRE do as good as predcom and a "cost model" for > PRE is out of the question for GCC 7 the following dumbs down PRE again. It > does so in the very much simplest way rather than trying to block this only > during elimination / insertion. This should be definitely revisited for GCC > 8. > > Index: gcc/tree-ssa-pre.c > === > --- gcc/tree-ssa-pre.c (revision 246026) > +++ gcc/tree-ssa-pre.c (working copy) > @@ -1468,10 +1468,20 @@ phi_translate_1 (pre_expr expr, bitmap_s >leader for it. */ > if (constant->kind != CONSTANT) > { > - unsigned value_id = get_expr_value_id (constant); > - constant = find_leader_in_sets (value_id, set1, set2); > - if (constant) > - return constant; > + /* Do not allow simplifications to non-constants over > + backedges as this will likely result in a loop PHI > node > + to be inserted and increased register pressure. > + See PR77498 - this avoids doing predcoms work in > + a less efficient way. */ > + if (find_edge (pred, phiblock)->flags & EDGE_DFS_BACK) > + ; > + else > + { > + unsigned value_id = get_expr_value_id (constant); > + constant = find_leader_in_sets (value_id, set1, > set2); > + if (constant) > + return constant; > + } > } > else > return constant; I don't know for Yuri's issue but at least it sadly does not help with the problem reported by Andre for arm-none-eabi [1]. I'll try to come up with a testcase next week. [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77498#c2
[Bug tree-optimization/77498] [7 regression] Performance drop after r239414 on spec2000/172mgrid
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77498 --- Comment #7 from Richard Biener --- Ok, so given we can't have PRE do as good as predcom and a "cost model" for PRE is out of the question for GCC 7 the following dumbs down PRE again. It does so in the very much simplest way rather than trying to block this only during elimination / insertion. This should be definitely revisited for GCC 8. Index: gcc/tree-ssa-pre.c === --- gcc/tree-ssa-pre.c (revision 246026) +++ gcc/tree-ssa-pre.c (working copy) @@ -1468,10 +1468,20 @@ phi_translate_1 (pre_expr expr, bitmap_s leader for it. */ if (constant->kind != CONSTANT) { - unsigned value_id = get_expr_value_id (constant); - constant = find_leader_in_sets (value_id, set1, set2); - if (constant) - return constant; + /* Do not allow simplifications to non-constants over + backedges as this will likely result in a loop PHI node + to be inserted and increased register pressure. + See PR77498 - this avoids doing predcoms work in + a less efficient way. */ + if (find_edge (pred, phiblock)->flags & EDGE_DFS_BACK) + ; + else + { + unsigned value_id = get_expr_value_id (constant); + constant = find_leader_in_sets (value_id, set1, set2); + if (constant) + return constant; + } } else return constant;
[Bug tree-optimization/77498] [7 regression] Performance drop after r239414 on spec2000/172mgrid
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77498 --- Comment #6 from Richard Biener --- For a testcase trying to show the issue: double U[1024]; double V[1024]; void foo (void) { for (unsigned i = 1; i < 1023; ++i) V[i] = U[i-1] + U[i] + U[i+1]; } we get from PRE (.optimized, w/ IVO disabled): [1.00%]: pretmp_19 = U[0]; pretmp_21 = U[1]; [99.00%]: # i_15 = PHI <_5(3), 1(2)> # prephitmp_20 = PHI # prephitmp_22 = PHI <_6(3), pretmp_21(2)> # ivtmp_1 = PHI _5 = i_15 + 1; _6 = U[_5]; _17 = _6 + prephitmp_22; _7 = _17 + prephitmp_20; V[i_15] = _7; ivtmp_18 = ivtmp_1 + 4294967295; if (ivtmp_18 != 0) goto ; [98.99%] else goto ; [1.01%] [1.00%]: return; while predcom does the same transform but unrolls the loop: [50.00%]: # i_15 = PHI <1(2), _43(3)> # ivtmp_18 = PHI <1022(2), ivtmp_49(3)> # U_I_lsm0.3_30 = PHI <_33(2), _6(3)> # U_I_lsm1.4_31 = PHI <_34(2), _44(3)> # ivtmp_50 = PHI <1021(2), ivtmp_51(3)> _5 = i_15 + 1; _6 = U[_5]; _37 = U_I_lsm0.3_30 + U_I_lsm1.4_31; _7 = _6 + _37; V[i_15] = _7; _43 = i_15 + 2; _44 = U[_43]; _46 = U_I_lsm1.4_31 + _44; _47 = _6 + _46; V[_5] = _47; ivtmp_49 = ivtmp_18 + 4294967294; ivtmp_51 = ivtmp_50 + 4294967294; if (ivtmp_51 > 1) goto ; [98.00%] else goto ; [2.00%] register pressure created by both are the same. For the more complex testcase PRE misses the "combination chains" because we exclude them as Found partial redundancy for expression {plus_expr,_2,_3} (0012) Skipping insertion of phi for partial redundancy: Looks like an induction variable when we mitigate that PRE can handle all cases where association is correct. predictive commoning in addition to that does re-association of adds to enable more chains (so the mitigation doesn't help for the original testcase). Testcase that is helped this way: double U[1024], W[1024]; double V[1024]; void foo (void) { for (unsigned i = 1; i < 1023; ++i) V[i] = (U[i-1] + W[i-1]) + (U[i] + W[i]) + (U[i+1] + W[i+1]); } and PRE produces [99.00%]: # i_21 = PHI <_9(3), 1(2)> # prephitmp_43 = PHI <_23(3), _42(2)> # prephitmp_45 = PHI # ivtmp_40 = PHI _9 = i_21 + 1; _10 = U[_9]; _11 = W[_9]; _23 = _10 + _11; _1 = _23 + prephitmp_43; _13 = _1 + prephitmp_45; V[i_21] = _13; ivtmp_38 = ivtmp_40 + 4294967295; if (ivtmp_38 != 0) goto ; [98.99%] else goto ; [1.01%] note how we PRE U[] + W[] rather than U[] and W[]. Index: gcc/tree-ssa-pre.c === --- gcc/tree-ssa-pre.c (revision 245594) +++ gcc/tree-ssa-pre.c (working copy) @@ -3008,7 +3008,9 @@ insert_into_preds_of_block (basic_block EDGE_PRED (block, 1)->src); /* Induction variables only have one edge inside the loop. */ if ((firstinsideloop ^ secondinsideloop) - && expr->kind != REFERENCE) + && expr->kind != REFERENCE + && (expr->kind != NARY + || INTEGRAL_TYPE_P (PRE_EXPR_NARY (expr)->type))) { if (dump_file && (dump_flags & TDF_DETAILS)) fprintf (dump_file, "Skipping insertion of phi for partial redundancy: Looks like an induction variable\n"); for the original testcase in this bug this removes two IVs (but the IV detection is lame). What we mostly want to avoid here is creating IVs for derived IVs, it might be enough to disregard && expr->kind == NARY && (PRE_EXPR_NARY (expr)->opcode == PLUS_EXPR || PRE_EXPR_NARY (expr)->opcode == POINTER_PLUS_EXPR || PRE_EXPR_NARY (expr)->opcode == MINUS_EXPR) && TREE_CODE (PRE_EXPR_NARY (expr)->op[1]) == INTEGER_CST) but as said, a solution inside PRE might improve some cases but it won't catch the cases needing re-association. Running pcom before PRE would be best here but then we've been there before (and put pcom after vectorization from previously before it).
[Bug tree-optimization/77498] [7 regression] Performance drop after r239414 on spec2000/172mgrid
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77498 --- Comment #5 from amker at gcc dot gnu.org --- (In reply to Richard Biener from comment #4) > CCing Bin, he was looking into PRE/predcom as well AFAIR. predictive > commoning here performs unrolling to be able to avoid some loop-carried > dependencies > while PRE has the larger distances covered by for example > > [85.00%]: > # prephitmp_656 = PHI <_125(6), pretmp_655(5)> > # prephitmp_674 = PHI > > this kind of loop carried PHIs should be a hint for a tree level unroller > to perform unrolling (just in case they literally appear in source for > example). > OTOH if unrolling can solve the RA problem then the it must be solvable not > unrolled as well? Note that with predcom we end up with 11 pointer IVs while > with PRE we have just one (but use 20 others from the outer loop...) - > possibly > the versioning predcom performs makes IVO not do any outer loop IVO. Using > -fschedule-insns -fsched-pressure helps somewhat but not much. > > So it looks like a RA related issue and IVO is as much relevant as PRE doing > predictive commoning at -O2 (and at -O3 doing predcoms job but worse in this > case). > > During PHI translation we can tame this down to a level pre this rev. again, > for example with the following. But ideally we'd compute antic and do > insertion > for the full dataflow problem and only apply this "cost modeling" during > elimination to not lose secondary level transforms that are profitable > (also below we do not know whether we need to insert a PHI for the value in > the end). > > Index: gcc/tree-ssa-pre.c > === > --- gcc/tree-ssa-pre.c (revision 244484) > +++ gcc/tree-ssa-pre.c (working copy) > @@ -1465,16 +1465,16 @@ phi_translate_1 (pre_expr expr, bitmap_s > { > /* For non-CONSTANTs we have to make sure we can eventually >insert the expression. Which means we need to have a > - leader for it. */ > - if (constant->kind != CONSTANT) > + leader for it. Avoid doing this across backedges though. > */ > + if (constant->kind == CONSTANT) > + return constant; > + else if (! dominated_by_p (CDI_DOMINATORS, pred, phiblock)) > { > unsigned value_id = get_expr_value_id (constant); > constant = find_leader_in_sets (value_id, set1, set2); > if (constant) > return constant; > } > - else > - return constant; > } > > tree result = vn_nary_op_lookup_pieces (newnary->length, > > > But as said, a whole different question is whether we want PRE to add IVs at > all > (but we do have some testcases requesting exactly that, for example > gcc.dg/tree-ssa/pr71347.c or ssa-pre-23.c requesting store-motion w/o > actually sinking the store). > > Index: gcc/tree-ssa-pre.c > === > --- gcc/tree-ssa-pre.c (revision 244484) > +++ gcc/tree-ssa-pre.c (working copy) > @@ -4290,6 +4290,31 @@ eliminate_dom_walker::before_dom_childre >VN_INFO_RANGE_INFO (lhs)); > } > > + if (sprime > + && TREE_CODE (sprime) == SSA_NAME > + && do_pre > + && loop_outer (b->loop_father) > + && has_zero_uses (sprime) > + && bitmap_bit_p (inserted_exprs, SSA_NAME_VERSION (sprime))) > + { > + gimple *def_stmt = SSA_NAME_DEF_STMT (sprime); > + basic_block def_bb = gimple_bb (def_stmt); > + if (gimple_code (def_stmt) == GIMPLE_PHI > + && def_bb->loop_father->header == def_bb) > + { > + bool ok = true; > + edge_iterator ei; > + edge e; > + FOR_EACH_EDGE (e, ei, def_bb->preds) > + if (dominated_by_p (CDI_DOMINATORS, e->src, e->dest) > + && TREE_CODE (PHI_ARG_DEF_FROM_EDGE (def_stmt, e)) > == SSA_NAME) > + ok = false; > + /* Don't keep sprime available. */ > + if (!ok) > + sprime = NULL_TREE; > + } > + } > + > /* Inhibit the use of an inserted PHI on a loop header when > the address of the memory reference is a simple induction > variable. In other cases the vectorizer won't do anything I am trying to model register pressure and use that information to direct predcom. So far it detects only one case 436.cactusADM but does improve a lot. Though it's hard to model cost of pre_expr, but for loop carries ones, we may be able to simply control the number using pressure information too. According to the report, t
[Bug tree-optimization/77498] [7 regression] Performance drop after r239414 on spec2000/172mgrid
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77498 Richard Biener changed: What|Removed |Added Keywords||missed-optimization, ra CC||amker at gcc dot gnu.org --- Comment #4 from Richard Biener --- CCing Bin, he was looking into PRE/predcom as well AFAIR. predictive commoning here performs unrolling to be able to avoid some loop-carried dependencies while PRE has the larger distances covered by for example [85.00%]: # prephitmp_656 = PHI <_125(6), pretmp_655(5)> # prephitmp_674 = PHI this kind of loop carried PHIs should be a hint for a tree level unroller to perform unrolling (just in case they literally appear in source for example). OTOH if unrolling can solve the RA problem then the it must be solvable not unrolled as well? Note that with predcom we end up with 11 pointer IVs while with PRE we have just one (but use 20 others from the outer loop...) - possibly the versioning predcom performs makes IVO not do any outer loop IVO. Using -fschedule-insns -fsched-pressure helps somewhat but not much. So it looks like a RA related issue and IVO is as much relevant as PRE doing predictive commoning at -O2 (and at -O3 doing predcoms job but worse in this case). During PHI translation we can tame this down to a level pre this rev. again, for example with the following. But ideally we'd compute antic and do insertion for the full dataflow problem and only apply this "cost modeling" during elimination to not lose secondary level transforms that are profitable (also below we do not know whether we need to insert a PHI for the value in the end). Index: gcc/tree-ssa-pre.c === --- gcc/tree-ssa-pre.c (revision 244484) +++ gcc/tree-ssa-pre.c (working copy) @@ -1465,16 +1465,16 @@ phi_translate_1 (pre_expr expr, bitmap_s { /* For non-CONSTANTs we have to make sure we can eventually insert the expression. Which means we need to have a - leader for it. */ - if (constant->kind != CONSTANT) + leader for it. Avoid doing this across backedges though. */ + if (constant->kind == CONSTANT) + return constant; + else if (! dominated_by_p (CDI_DOMINATORS, pred, phiblock)) { unsigned value_id = get_expr_value_id (constant); constant = find_leader_in_sets (value_id, set1, set2); if (constant) return constant; } - else - return constant; } tree result = vn_nary_op_lookup_pieces (newnary->length, But as said, a whole different question is whether we want PRE to add IVs at all (but we do have some testcases requesting exactly that, for example gcc.dg/tree-ssa/pr71347.c or ssa-pre-23.c requesting store-motion w/o actually sinking the store). Index: gcc/tree-ssa-pre.c === --- gcc/tree-ssa-pre.c (revision 244484) +++ gcc/tree-ssa-pre.c (working copy) @@ -4290,6 +4290,31 @@ eliminate_dom_walker::before_dom_childre VN_INFO_RANGE_INFO (lhs)); } + if (sprime + && TREE_CODE (sprime) == SSA_NAME + && do_pre + && loop_outer (b->loop_father) + && has_zero_uses (sprime) + && bitmap_bit_p (inserted_exprs, SSA_NAME_VERSION (sprime))) + { + gimple *def_stmt = SSA_NAME_DEF_STMT (sprime); + basic_block def_bb = gimple_bb (def_stmt); + if (gimple_code (def_stmt) == GIMPLE_PHI + && def_bb->loop_father->header == def_bb) + { + bool ok = true; + edge_iterator ei; + edge e; + FOR_EACH_EDGE (e, ei, def_bb->preds) + if (dominated_by_p (CDI_DOMINATORS, e->src, e->dest) + && TREE_CODE (PHI_ARG_DEF_FROM_EDGE (def_stmt, e)) == SSA_NAME) + ok = false; + /* Don't keep sprime available. */ + if (!ok) + sprime = NULL_TREE; + } + } + /* Inhibit the use of an inserted PHI on a loop header when the address of the memory reference is a simple induction variable. In other cases the vectorizer won't do anything
[Bug tree-optimization/77498] [7 regression] Performance drop after r239414 on spec2000/172mgrid
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77498 Richard Biener changed: What|Removed |Added Priority|P3 |P1
[Bug tree-optimization/77498] [7 regression] Performance drop after r239414 on spec2000/172mgrid
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77498 Richard Biener changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2016-09-07 Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org Target Milestone|--- |7.0 Ever confirmed|0 |1 --- Comment #3 from Richard Biener --- Note this revision isn't really related to code hoisting. It merely allows PRE to perform simple predictive commoning and more PRE in general. The commoning can interfere with sinking (see the adjusted testcase). For the testcase we apply commoning which increases register pressure. The pcom pass does a better job (well, it was designed for this). I suppose this PRE improvement raises the general question (again) whether we want it to introduce loop-carried dependences at all. In this case it trades 18 loads for 18 loop-carried dependences - optimally reg colaesced and thus "free", maybe reg-reg copies or worst spills (as seen here). I'll need to think about this (again).
[Bug tree-optimization/77498] [7 regression] Performance drop after r239414 on spec2000/172mgrid
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77498 avieira at gcc dot gnu.org changed: What|Removed |Added Target||arm-none-eabi CC||avieira at gcc dot gnu.org --- Comment #2 from avieira at gcc dot gnu.org --- I am observing some regressions for arm-none-eabi on a Cortex-M0+ for a popular embedded benchmark following this patch. I believe register pressure might also be the root cause of this given the significant increase of loads and registers from and to the stack. Though I need to have a better look. Passing the option -fno-code-hoisting brings the performance numbers back up.
[Bug tree-optimization/77498] [7 regression] Performance drop after r239414 on spec2000/172mgrid
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77498 --- Comment #1 from Yuri Rumyantsev --- Created attachment 39574 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=39574&action=edit test-case to reproduce Need to compile with -O2 -ffast-math to reproduce.