[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 --- Comment #28 from Thomas Preud'homme --- (In reply to rguent...@suse.de from comment #27) > On Fri, 7 Apr 2017, thopre01 at gcc dot gnu.org wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 > > > > --- Comment #26 from Thomas Preud'homme --- > > Yes true. > > > > By the way, your first patch did improve things a bit. If it's not too > > invasive > > could it be possible to commit it for GCC 7? > > It's not really in the shape for being committable. I'll see if I can > come up with sth better early in stage1 that might be back-portable. Thanks. > > > I think I'll try the approach you suggested in > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77498#c10 and see if I get some > > better results. > > For this case distance is minimal already so I don't expect you can > do anything (there's also not really separate dataflow for hoisting). If I remember well the key in the benchmark was that the basic block where the code was hoisted from could be reached by a much longer path. So although it was hoisted to a direct predecessor, sometime that value had to be kept for a very long time.
[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 --- Comment #27 from rguenther at suse dot de --- On Fri, 7 Apr 2017, thopre01 at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 > > --- Comment #26 from Thomas Preud'homme --- > Yes true. > > By the way, your first patch did improve things a bit. If it's not too > invasive > could it be possible to commit it for GCC 7? It's not really in the shape for being committable. I'll see if I can come up with sth better early in stage1 that might be back-portable. > I think I'll try the approach you suggested in > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77498#c10 and see if I get some > better results. For this case distance is minimal already so I don't expect you can do anything (there's also not really separate dataflow for hoisting).
[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 --- Comment #26 from Thomas Preud'homme --- (In reply to rguent...@suse.de from comment #25) > On Fri, 7 Apr 2017, thopre01 at gcc dot gnu.org wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 > > > > --- Comment #24 from Thomas Preud'homme --- > > (In reply to rguent...@suse.de from comment #23) > > > On Fri, 7 Apr 2017, thopre01 at gcc dot gnu.org wrote: > > > > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 > > > > > > > > --- Comment #22 from Thomas Preud'homme > > > > --- > > > > (In reply to Thomas Preud'homme from comment #21) > > > > > > > > > > I can see this behavior for Cortex-M0+ indeed but the results are > > > > > different > > > > > for Cortex-M7 for me: > > > > > > > > > > > > > > > % arm-none-linux-gnueabi-gcc -S ~/cortex-m0plus_reproducer.c -O2 > > > > > -mcpu=cortex-m7 -mthumb -fdump-tree-pre-details -fdump-tree-crited > > > > > -fdump-tree-all -funroll-all-loops > > > > > > > > > > % grep push cortex-m0plus_reproducer.s > > > > > push{r4, r5} > > > > > > > > > > > > > > > > > > > > % arm-none-linux-gnueabi-gcc -S ~/cortex-m0plus_reproducer.c -O2 > > > > > -mcpu=cortex-m7 -mthumb -fdump-tree-pre-details -fdump-tree-crited > > > > > -fdump-tree-all -funroll-all-loops -fno-code-hoisting > > > > > > > > > > % grep push cortex-m0plus_reproducer.s > > > > > push{r4} > > > > > > > > > > Would you mind checking again? > > > > > > > > btw, this toolchain was built on 2017-04-04 in case that's the issue. > > > > > > Ok, I can reproduce it now. With the proposed patch to fix the other > > > case it regresses with -fno-code-hoisting though (comment #9 plus fix). > > > > > > Note that code-hosting is simply doing it's job here, hoisting common > > > code out of a switch. > > > > I understand that. Code is smaller so it's definitely worthwhile at -Os. > > But it > > sometimes lead to worse performance due to more spilling. > > But there's no good way to estimate that. We have tons of transforms > that have the issue of being "sometimes worse" Yes true. By the way, your first patch did improve things a bit. If it's not too invasive could it be possible to commit it for GCC 7? I think I'll try the approach you suggested in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77498#c10 and see if I get some better results.
[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 --- Comment #25 from rguenther at suse dot de --- On Fri, 7 Apr 2017, thopre01 at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 > > --- Comment #24 from Thomas Preud'homme --- > (In reply to rguent...@suse.de from comment #23) > > On Fri, 7 Apr 2017, thopre01 at gcc dot gnu.org wrote: > > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 > > > > > > --- Comment #22 from Thomas Preud'homme --- > > > (In reply to Thomas Preud'homme from comment #21) > > > > > > > > I can see this behavior for Cortex-M0+ indeed but the results are > > > > different > > > > for Cortex-M7 for me: > > > > > > > > > > > > % arm-none-linux-gnueabi-gcc -S ~/cortex-m0plus_reproducer.c -O2 > > > > -mcpu=cortex-m7 -mthumb -fdump-tree-pre-details -fdump-tree-crited > > > > -fdump-tree-all -funroll-all-loops > > > > > > > > % grep push cortex-m0plus_reproducer.s > > > > push{r4, r5} > > > > > > > > > > > > > > > > % arm-none-linux-gnueabi-gcc -S ~/cortex-m0plus_reproducer.c -O2 > > > > -mcpu=cortex-m7 -mthumb -fdump-tree-pre-details -fdump-tree-crited > > > > -fdump-tree-all -funroll-all-loops -fno-code-hoisting > > > > > > > > % grep push cortex-m0plus_reproducer.s > > > > push{r4} > > > > > > > > Would you mind checking again? > > > > > > btw, this toolchain was built on 2017-04-04 in case that's the issue. > > > > Ok, I can reproduce it now. With the proposed patch to fix the other > > case it regresses with -fno-code-hoisting though (comment #9 plus fix). > > > > Note that code-hosting is simply doing it's job here, hoisting common > > code out of a switch. > > I understand that. Code is smaller so it's definitely worthwhile at -Os. But > it > sometimes lead to worse performance due to more spilling. But there's no good way to estimate that. We have tons of transforms that have the issue of being "sometimes worse"
[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 --- Comment #24 from Thomas Preud'homme --- (In reply to rguent...@suse.de from comment #23) > On Fri, 7 Apr 2017, thopre01 at gcc dot gnu.org wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 > > > > --- Comment #22 from Thomas Preud'homme --- > > (In reply to Thomas Preud'homme from comment #21) > > > > > > I can see this behavior for Cortex-M0+ indeed but the results are > > > different > > > for Cortex-M7 for me: > > > > > > > > > % arm-none-linux-gnueabi-gcc -S ~/cortex-m0plus_reproducer.c -O2 > > > -mcpu=cortex-m7 -mthumb -fdump-tree-pre-details -fdump-tree-crited > > > -fdump-tree-all -funroll-all-loops > > > > > > % grep push cortex-m0plus_reproducer.s > > > push{r4, r5} > > > > > > > > > > > > % arm-none-linux-gnueabi-gcc -S ~/cortex-m0plus_reproducer.c -O2 > > > -mcpu=cortex-m7 -mthumb -fdump-tree-pre-details -fdump-tree-crited > > > -fdump-tree-all -funroll-all-loops -fno-code-hoisting > > > > > > % grep push cortex-m0plus_reproducer.s > > > push{r4} > > > > > > Would you mind checking again? > > > > btw, this toolchain was built on 2017-04-04 in case that's the issue. > > Ok, I can reproduce it now. With the proposed patch to fix the other > case it regresses with -fno-code-hoisting though (comment #9 plus fix). > > Note that code-hosting is simply doing it's job here, hoisting common > code out of a switch. I understand that. Code is smaller so it's definitely worthwhile at -Os. But it sometimes lead to worse performance due to more spilling.
[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 --- Comment #23 from rguenther at suse dot de --- On Fri, 7 Apr 2017, thopre01 at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 > > --- Comment #22 from Thomas Preud'homme --- > (In reply to Thomas Preud'homme from comment #21) > > > > I can see this behavior for Cortex-M0+ indeed but the results are different > > for Cortex-M7 for me: > > > > > > % arm-none-linux-gnueabi-gcc -S ~/cortex-m0plus_reproducer.c -O2 > > -mcpu=cortex-m7 -mthumb -fdump-tree-pre-details -fdump-tree-crited > > -fdump-tree-all -funroll-all-loops > > > > % grep push cortex-m0plus_reproducer.s > > push{r4, r5} > > > > > > > > % arm-none-linux-gnueabi-gcc -S ~/cortex-m0plus_reproducer.c -O2 > > -mcpu=cortex-m7 -mthumb -fdump-tree-pre-details -fdump-tree-crited > > -fdump-tree-all -funroll-all-loops -fno-code-hoisting > > > > % grep push cortex-m0plus_reproducer.s > > push{r4} > > > > Would you mind checking again? > > btw, this toolchain was built on 2017-04-04 in case that's the issue. Ok, I can reproduce it now. With the proposed patch to fix the other case it regresses with -fno-code-hoisting though (comment #9 plus fix). Note that code-hosting is simply doing it's job here, hoisting common code out of a switch.
[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 --- Comment #22 from Thomas Preud'homme --- (In reply to Thomas Preud'homme from comment #21) > > I can see this behavior for Cortex-M0+ indeed but the results are different > for Cortex-M7 for me: > > > % arm-none-linux-gnueabi-gcc -S ~/cortex-m0plus_reproducer.c -O2 > -mcpu=cortex-m7 -mthumb -fdump-tree-pre-details -fdump-tree-crited > -fdump-tree-all -funroll-all-loops > > % grep push cortex-m0plus_reproducer.s > push{r4, r5} > > > > % arm-none-linux-gnueabi-gcc -S ~/cortex-m0plus_reproducer.c -O2 > -mcpu=cortex-m7 -mthumb -fdump-tree-pre-details -fdump-tree-crited > -fdump-tree-all -funroll-all-loops -fno-code-hoisting > > % grep push cortex-m0plus_reproducer.s > push{r4} > > Would you mind checking again? btw, this toolchain was built on 2017-04-04 in case that's the issue.
[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 --- Comment #21 from Thomas Preud'homme --- (In reply to Richard Biener from comment #20) > (In reply to Thomas Preud'homme from comment #19) > > (In reply to Richard Biener from comment #18) > > > (In reply to Thomas Preud'homme from comment #17) > > > > (In reply to rguent...@suse.de from comment #16) > > > > > On Thu, 23 Mar 2017, thopre01 at gcc dot gnu.org wrote: > > > > > > > > > > > > Funnily this led back to the Cortex-M0+ reduced testcase. With the > > > > > > patch in > > > > > > comment #13 applied we can still see a difference in the push (one > > > > > > register > > > > > > pushed Vs 0). > > > > > > > > > > I can't reproduce zero pushes here I get three with/without > > > > > -fno-code-hoisting. code hoisting hoists the two loads inside > > > > > the switch before it so we have > > > > > > > > Ooops my apologize, it needs more flags indeed. -O2 -funroll-all-loops > > > > shows > > > > 2 registers pushed Vs 1 when -fno-code-hoisting is added. > > > > > > Still can't reproduce. I've configured with > > > > > > /space/rguenther/src/svn/gcc-7-branch/configure > > > --target=arm-suse-linux-gnueabi --disable-libstdcxx-pch > > > --enable-languages=c,c++ > > > > > > and am using > > > > > > ./cc1 -quiet cortex-m0plus_reproducer.c -O2 -mcpu=cortex-m0plus -mthumb -I > > > include -fdump-tree-pre-details -fdump-tree-crited -fdump-tree-all > > > -funroll-all-loops -fno-code-hoisting > > > > Oh my bad, I was still not clear enough. I meant using the Cortex-M0+ > > testcase but build it for Cortex-M7. Hopefully with -mcpu=cortex-m7 you > > should see a difference between with and without code hoisting. > > I do. Two pushes with code hoisting and three without. Changed -mcpu above > to -mcpu=cortex-m7. I can see this behavior for Cortex-M0+ indeed but the results are different for Cortex-M7 for me: % arm-none-linux-gnueabi-gcc -S ~/cortex-m0plus_reproducer.c -O2 -mcpu=cortex-m7 -mthumb -fdump-tree-pre-details -fdump-tree-crited -fdump-tree-all -funroll-all-loops % grep push cortex-m0plus_reproducer.s push{r4, r5} % arm-none-linux-gnueabi-gcc -S ~/cortex-m0plus_reproducer.c -O2 -mcpu=cortex-m7 -mthumb -fdump-tree-pre-details -fdump-tree-crited -fdump-tree-all -funroll-all-loops -fno-code-hoisting % grep push cortex-m0plus_reproducer.s push{r4} Would you mind checking again?
[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 Jeffrey A. Law changed: What|Removed |Added Priority|P3 |P2 CC||law at redhat dot com
[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 --- Comment #20 from Richard Biener --- (In reply to Thomas Preud'homme from comment #19) > (In reply to Richard Biener from comment #18) > > (In reply to Thomas Preud'homme from comment #17) > > > (In reply to rguent...@suse.de from comment #16) > > > > On Thu, 23 Mar 2017, thopre01 at gcc dot gnu.org wrote: > > > > > > > > > > Funnily this led back to the Cortex-M0+ reduced testcase. With the > > > > > patch in > > > > > comment #13 applied we can still see a difference in the push (one > > > > > register > > > > > pushed Vs 0). > > > > > > > > I can't reproduce zero pushes here I get three with/without > > > > -fno-code-hoisting. code hoisting hoists the two loads inside > > > > the switch before it so we have > > > > > > Ooops my apologize, it needs more flags indeed. -O2 -funroll-all-loops > > > shows > > > 2 registers pushed Vs 1 when -fno-code-hoisting is added. > > > > Still can't reproduce. I've configured with > > > > /space/rguenther/src/svn/gcc-7-branch/configure > > --target=arm-suse-linux-gnueabi --disable-libstdcxx-pch > > --enable-languages=c,c++ > > > > and am using > > > > ./cc1 -quiet cortex-m0plus_reproducer.c -O2 -mcpu=cortex-m0plus -mthumb -I > > include -fdump-tree-pre-details -fdump-tree-crited -fdump-tree-all > > -funroll-all-loops -fno-code-hoisting > > Oh my bad, I was still not clear enough. I meant using the Cortex-M0+ > testcase but build it for Cortex-M7. Hopefully with -mcpu=cortex-m7 you > should see a difference between with and without code hoisting. I do. Two pushes with code hoisting and three without. Changed -mcpu above to -mcpu=cortex-m7.
[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 --- Comment #19 from Thomas Preud'homme --- (In reply to Richard Biener from comment #18) > (In reply to Thomas Preud'homme from comment #17) > > (In reply to rguent...@suse.de from comment #16) > > > On Thu, 23 Mar 2017, thopre01 at gcc dot gnu.org wrote: > > > > > > > > Funnily this led back to the Cortex-M0+ reduced testcase. With the > > > > patch in > > > > comment #13 applied we can still see a difference in the push (one > > > > register > > > > pushed Vs 0). > > > > > > I can't reproduce zero pushes here I get three with/without > > > -fno-code-hoisting. code hoisting hoists the two loads inside > > > the switch before it so we have > > > > Ooops my apologize, it needs more flags indeed. -O2 -funroll-all-loops shows > > 2 registers pushed Vs 1 when -fno-code-hoisting is added. > > Still can't reproduce. I've configured with > > /space/rguenther/src/svn/gcc-7-branch/configure > --target=arm-suse-linux-gnueabi --disable-libstdcxx-pch > --enable-languages=c,c++ > > and am using > > ./cc1 -quiet cortex-m0plus_reproducer.c -O2 -mcpu=cortex-m0plus -mthumb -I > include -fdump-tree-pre-details -fdump-tree-crited -fdump-tree-all > -funroll-all-loops -fno-code-hoisting Oh my bad, I was still not clear enough. I meant using the Cortex-M0+ testcase but build it for Cortex-M7. Hopefully with -mcpu=cortex-m7 you should see a difference between with and without code hoisting.
[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 --- Comment #18 from Richard Biener --- (In reply to Thomas Preud'homme from comment #17) > (In reply to rguent...@suse.de from comment #16) > > On Thu, 23 Mar 2017, thopre01 at gcc dot gnu.org wrote: > > > > > > Funnily this led back to the Cortex-M0+ reduced testcase. With the patch > > > in > > > comment #13 applied we can still see a difference in the push (one > > > register > > > pushed Vs 0). > > > > I can't reproduce zero pushes here I get three with/without > > -fno-code-hoisting. code hoisting hoists the two loads inside > > the switch before it so we have > > Ooops my apologize, it needs more flags indeed. -O2 -funroll-all-loops shows > 2 registers pushed Vs 1 when -fno-code-hoisting is added. Still can't reproduce. I've configured with /space/rguenther/src/svn/gcc-7-branch/configure --target=arm-suse-linux-gnueabi --disable-libstdcxx-pch --enable-languages=c,c++ and am using ./cc1 -quiet cortex-m0plus_reproducer.c -O2 -mcpu=cortex-m0plus -mthumb -I include -fdump-tree-pre-details -fdump-tree-crited -fdump-tree-all -funroll-all-loops -fno-code-hoisting seeing 3 pushes: fn1: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 movsr3, #0 push{r4, r5, lr} I see softfp being used, not sure how I could change that (if that's what I'm missing). Btw, -funroll-all-loops disqualifies this testcase ;)
[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 --- Comment #17 from Thomas Preud'homme --- (In reply to rguent...@suse.de from comment #16) > On Thu, 23 Mar 2017, thopre01 at gcc dot gnu.org wrote: > > > > Funnily this led back to the Cortex-M0+ reduced testcase. With the patch in > > comment #13 applied we can still see a difference in the push (one register > > pushed Vs 0). > > I can't reproduce zero pushes here I get three with/without > -fno-code-hoisting. code hoisting hoists the two loads inside > the switch before it so we have Ooops my apologize, it needs more flags indeed. -O2 -funroll-all-loops shows 2 registers pushed Vs 1 when -fno-code-hoisting is added.
[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 --- Comment #16 from rguenther at suse dot de --- On Thu, 23 Mar 2017, thopre01 at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 > > --- Comment #15 from Thomas Preud'homme --- > (In reply to Thomas Preud'homme from comment #14) > > (In reply to rguent...@suse.de from comment #13) > > > On Thu, 23 Mar 2017, thopre01 at gcc dot gnu.org wrote: > > > > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 > > > > > > > > --- Comment #12 from Thomas Preud'homme > > > > --- > > > > (In reply to Richard Biener from comment #9) > > > > > Ah, the patches do not fix the testcase because the testcase is _not_ > > > > > the > > > > > PRE-creates-IV case. It's indeed simply hoisting/PRE at work > > > > > transforming > > > > > > > > > > # a_14 = PHI> > > > > if (!b) > > > > > a_8 = a_14 + 1; > > > > > > > > > > # a_2 = PHI > > > > > a_10 = a_2 + 1; > > > > > ... = *(a_2 + 1); > > > > > > > > > > to > > > > > > > > > > # a_14 = PHI > > > > > _4 = a_14 + 1; > > > > > if (b) > > > > > _3 = _4 + 1; > > > > > > > > > > # a_2 = PHI > > > > > # prephitmp_12 = PHI <_4, _3> > > > > > ... = *(a_2 + 1); > > > > > > > > > > increasing register pressure mainly because nothing figures that a_2 > > > > > + 1 > > > > > in the dereference can be replaced by prephitmp_12 ... > > > > > > > > > > So this is a missed SLSR opportunity or, in this simple form, a missed > > > > > PRE/CSE opportunity. Fixing that with the following (otherwise > > > > > untested) > > > > > restores good code generation for the testcase: > > > > > > > > > > Index: gcc/tree-ssa-pre.c > > > > > === > > > > > --- gcc/tree-ssa-pre.c (revision 246414) > > > > > +++ gcc/tree-ssa-pre.c (working copy) > > > > > @@ -4636,6 +4610,35 @@ eliminate_dom_walker::before_dom_childre > > > > > } > > > > > } > > > > > > > > > > + if (gimple_has_ops (stmt)) > > > > > + for (unsigned i = 0; i < gimple_num_ops (stmt); ++i) > > > > > + { > > > > > + tree op = gimple_op (stmt, i); > > > > > + if (op) > > > > > + op = get_base_address (op); > > > > > + if (op > > > > > + && TREE_CODE (op) == MEM_REF > > > > > + && ! integer_zerop (TREE_OPERAND (op, 1))) > > > > > + { > > > > > + tree ops[2]; > > > > > + vn_nary_op_t nary; > > > > > + ops[0] = TREE_OPERAND (op, 0); > > > > > + ops[1] = TREE_OPERAND (op, 1); > > > > > + tree res = vn_nary_op_lookup_pieces (2, > > > > > POINTER_PLUS_EXPR, > > > > > +TREE_TYPE > > > > > (ops[0]), > > > > > +ops, ); > > > > > + if (res && TREE_CODE (res) == SSA_NAME) > > > > > + res = eliminate_avail (res); > > > > > + if (res) > > > > > + { > > > > > + TREE_OPERAND (op, 0) = res; > > > > > + TREE_OPERAND (op, 1) > > > > > + = build_int_cst (TREE_TYPE (TREE_OPERAND (op, > > > > > 1)), 0); > > > > > + gimple_set_modified (stmt, true); > > > > > > Add > > > > > > if (TREE_CODE (res) == SSA_NAME > > > && !is_gimple_debug (stmt)) > > > gimple_set_plf (SSA_NAME_DEF_STMT (res), > > > NECESSARY, true); > > > > > > here. > > > > > > > > + } > > > > > + } > > > > > + } > > > > > + > > > > >if (gimple_modified_p (stmt)) > > > > > { > > > > > /* If a formerly non-invariant ADDR_EXPR is turned into an > > > > > > > > > > note that in general optimzing > > > > > > > > > >q = p + 1; > > > > > = ...*(p + 1); > > > > > > > > > > "back" to *q will be undone by forwprop/stmt folding later but in this > > > > > case the feeding stmt is a PHI node and not a pointer-plus. It still > > > > > means that the change might be a bit too disruptive at this point > > > > > (we could restricit it a bit by only handling the case where we don't > > > > > replace with a pointer-plus result). > > > > > > > > Thanks for your work on this! Sadly GCC ICEs with this patch: > > > > > > > > 0xd36f53 update_dep_bb > > > > gcc/tree-ssa-tail-merge.c:411 > > > > 0xd38f54 stmt_update_dep_bb > > > > gcc/tree-ssa-tail-merge.c:429 > > > > 0xd38f54 same_succ_hash > > > > gcc/tree-ssa-tail-merge.c:452 > > > > 0xd38f54 find_same_succ_bb > > > > gcc/tree-ssa-tail-merge.c:717 > > > > 0xd39927 find_same_succ > > > > gcc/tree-ssa-tail-merge.c:748 > > > > 0xd39927 init_worklist > > > >
[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 --- Comment #15 from Thomas Preud'homme --- (In reply to Thomas Preud'homme from comment #14) > (In reply to rguent...@suse.de from comment #13) > > On Thu, 23 Mar 2017, thopre01 at gcc dot gnu.org wrote: > > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 > > > > > > --- Comment #12 from Thomas Preud'homme --- > > > (In reply to Richard Biener from comment #9) > > > > Ah, the patches do not fix the testcase because the testcase is _not_ > > > > the > > > > PRE-creates-IV case. It's indeed simply hoisting/PRE at work > > > > transforming > > > > > > > > # a_14 = PHI> > > > if (!b) > > > > a_8 = a_14 + 1; > > > > > > > > # a_2 = PHI > > > > a_10 = a_2 + 1; > > > > ... = *(a_2 + 1); > > > > > > > > to > > > > > > > > # a_14 = PHI > > > > _4 = a_14 + 1; > > > > if (b) > > > > _3 = _4 + 1; > > > > > > > > # a_2 = PHI > > > > # prephitmp_12 = PHI <_4, _3> > > > > ... = *(a_2 + 1); > > > > > > > > increasing register pressure mainly because nothing figures that a_2 + 1 > > > > in the dereference can be replaced by prephitmp_12 ... > > > > > > > > So this is a missed SLSR opportunity or, in this simple form, a missed > > > > PRE/CSE opportunity. Fixing that with the following (otherwise > > > > untested) > > > > restores good code generation for the testcase: > > > > > > > > Index: gcc/tree-ssa-pre.c > > > > === > > > > --- gcc/tree-ssa-pre.c (revision 246414) > > > > +++ gcc/tree-ssa-pre.c (working copy) > > > > @@ -4636,6 +4610,35 @@ eliminate_dom_walker::before_dom_childre > > > > } > > > > } > > > > > > > > + if (gimple_has_ops (stmt)) > > > > + for (unsigned i = 0; i < gimple_num_ops (stmt); ++i) > > > > + { > > > > + tree op = gimple_op (stmt, i); > > > > + if (op) > > > > + op = get_base_address (op); > > > > + if (op > > > > + && TREE_CODE (op) == MEM_REF > > > > + && ! integer_zerop (TREE_OPERAND (op, 1))) > > > > + { > > > > + tree ops[2]; > > > > + vn_nary_op_t nary; > > > > + ops[0] = TREE_OPERAND (op, 0); > > > > + ops[1] = TREE_OPERAND (op, 1); > > > > + tree res = vn_nary_op_lookup_pieces (2, > > > > POINTER_PLUS_EXPR, > > > > +TREE_TYPE (ops[0]), > > > > +ops, ); > > > > + if (res && TREE_CODE (res) == SSA_NAME) > > > > + res = eliminate_avail (res); > > > > + if (res) > > > > + { > > > > + TREE_OPERAND (op, 0) = res; > > > > + TREE_OPERAND (op, 1) > > > > + = build_int_cst (TREE_TYPE (TREE_OPERAND (op, > > > > 1)), 0); > > > > + gimple_set_modified (stmt, true); > > > > Add > > > > if (TREE_CODE (res) == SSA_NAME > > && !is_gimple_debug (stmt)) > > gimple_set_plf (SSA_NAME_DEF_STMT (res), > > NECESSARY, true); > > > > here. > > > > > > + } > > > > + } > > > > + } > > > > + > > > >if (gimple_modified_p (stmt)) > > > > { > > > > /* If a formerly non-invariant ADDR_EXPR is turned into an > > > > > > > > note that in general optimzing > > > > > > > >q = p + 1; > > > > = ...*(p + 1); > > > > > > > > "back" to *q will be undone by forwprop/stmt folding later but in this > > > > case the feeding stmt is a PHI node and not a pointer-plus. It still > > > > means that the change might be a bit too disruptive at this point > > > > (we could restricit it a bit by only handling the case where we don't > > > > replace with a pointer-plus result). > > > > > > Thanks for your work on this! Sadly GCC ICEs with this patch: > > > > > > 0xd36f53 update_dep_bb > > > gcc/tree-ssa-tail-merge.c:411 > > > 0xd38f54 stmt_update_dep_bb > > > gcc/tree-ssa-tail-merge.c:429 > > > 0xd38f54 same_succ_hash > > > gcc/tree-ssa-tail-merge.c:452 > > > 0xd38f54 find_same_succ_bb > > > gcc/tree-ssa-tail-merge.c:717 > > > 0xd39927 find_same_succ > > > gcc/tree-ssa-tail-merge.c:748 > > > 0xd39927 init_worklist > > > gcc/tree-ssa-tail-merge.c:767 > > > 0xd39927 tail_merge_optimize(unsigned int) > > > gcc/tree-ssa-tail-merge.c:1726 > > > 0xce2d6a execute > > > gcc/tree-ssa-pre.c:5164 > > > > > > > > There's progress. Performance is improved but not as much as disabling code > hoisting. I'll try to reduce the testcase again with that patch to see if I > can find a testcase that expose all issues. Funnily this led back to the
[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 --- Comment #14 from Thomas Preud'homme --- (In reply to rguent...@suse.de from comment #13) > On Thu, 23 Mar 2017, thopre01 at gcc dot gnu.org wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 > > > > --- Comment #12 from Thomas Preud'homme --- > > (In reply to Richard Biener from comment #9) > > > Ah, the patches do not fix the testcase because the testcase is _not_ the > > > PRE-creates-IV case. It's indeed simply hoisting/PRE at work transforming > > > > > > # a_14 = PHI> > > if (!b) > > > a_8 = a_14 + 1; > > > > > > # a_2 = PHI > > > a_10 = a_2 + 1; > > > ... = *(a_2 + 1); > > > > > > to > > > > > > # a_14 = PHI > > > _4 = a_14 + 1; > > > if (b) > > > _3 = _4 + 1; > > > > > > # a_2 = PHI > > > # prephitmp_12 = PHI <_4, _3> > > > ... = *(a_2 + 1); > > > > > > increasing register pressure mainly because nothing figures that a_2 + 1 > > > in the dereference can be replaced by prephitmp_12 ... > > > > > > So this is a missed SLSR opportunity or, in this simple form, a missed > > > PRE/CSE opportunity. Fixing that with the following (otherwise untested) > > > restores good code generation for the testcase: > > > > > > Index: gcc/tree-ssa-pre.c > > > === > > > --- gcc/tree-ssa-pre.c (revision 246414) > > > +++ gcc/tree-ssa-pre.c (working copy) > > > @@ -4636,6 +4610,35 @@ eliminate_dom_walker::before_dom_childre > > > } > > > } > > > > > > + if (gimple_has_ops (stmt)) > > > + for (unsigned i = 0; i < gimple_num_ops (stmt); ++i) > > > + { > > > + tree op = gimple_op (stmt, i); > > > + if (op) > > > + op = get_base_address (op); > > > + if (op > > > + && TREE_CODE (op) == MEM_REF > > > + && ! integer_zerop (TREE_OPERAND (op, 1))) > > > + { > > > + tree ops[2]; > > > + vn_nary_op_t nary; > > > + ops[0] = TREE_OPERAND (op, 0); > > > + ops[1] = TREE_OPERAND (op, 1); > > > + tree res = vn_nary_op_lookup_pieces (2, POINTER_PLUS_EXPR, > > > +TREE_TYPE (ops[0]), > > > +ops, ); > > > + if (res && TREE_CODE (res) == SSA_NAME) > > > + res = eliminate_avail (res); > > > + if (res) > > > + { > > > + TREE_OPERAND (op, 0) = res; > > > + TREE_OPERAND (op, 1) > > > + = build_int_cst (TREE_TYPE (TREE_OPERAND (op, 1)), > > > 0); > > > + gimple_set_modified (stmt, true); > > Add > > if (TREE_CODE (res) == SSA_NAME > && !is_gimple_debug (stmt)) > gimple_set_plf (SSA_NAME_DEF_STMT (res), > NECESSARY, true); > > here. > > > > + } > > > + } > > > + } > > > + > > >if (gimple_modified_p (stmt)) > > > { > > > /* If a formerly non-invariant ADDR_EXPR is turned into an > > > > > > note that in general optimzing > > > > > >q = p + 1; > > > = ...*(p + 1); > > > > > > "back" to *q will be undone by forwprop/stmt folding later but in this > > > case the feeding stmt is a PHI node and not a pointer-plus. It still > > > means that the change might be a bit too disruptive at this point > > > (we could restricit it a bit by only handling the case where we don't > > > replace with a pointer-plus result). > > > > Thanks for your work on this! Sadly GCC ICEs with this patch: > > > > 0xd36f53 update_dep_bb > > gcc/tree-ssa-tail-merge.c:411 > > 0xd38f54 stmt_update_dep_bb > > gcc/tree-ssa-tail-merge.c:429 > > 0xd38f54 same_succ_hash > > gcc/tree-ssa-tail-merge.c:452 > > 0xd38f54 find_same_succ_bb > > gcc/tree-ssa-tail-merge.c:717 > > 0xd39927 find_same_succ > > gcc/tree-ssa-tail-merge.c:748 > > 0xd39927 init_worklist > > gcc/tree-ssa-tail-merge.c:767 > > 0xd39927 tail_merge_optimize(unsigned int) > > gcc/tree-ssa-tail-merge.c:1726 > > 0xce2d6a execute > > gcc/tree-ssa-pre.c:5164 > > > > There's progress. Performance is improved but not as much as disabling code hoisting. I'll try to reduce the testcase again with that patch to see if I can find a testcase that expose all issues.
[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 --- Comment #13 from rguenther at suse dot de --- On Thu, 23 Mar 2017, thopre01 at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 > > --- Comment #12 from Thomas Preud'homme --- > (In reply to Richard Biener from comment #9) > > Ah, the patches do not fix the testcase because the testcase is _not_ the > > PRE-creates-IV case. It's indeed simply hoisting/PRE at work transforming > > > > # a_14 = PHI> > if (!b) > > a_8 = a_14 + 1; > > > > # a_2 = PHI > > a_10 = a_2 + 1; > > ... = *(a_2 + 1); > > > > to > > > > # a_14 = PHI > > _4 = a_14 + 1; > > if (b) > > _3 = _4 + 1; > > > > # a_2 = PHI > > # prephitmp_12 = PHI <_4, _3> > > ... = *(a_2 + 1); > > > > increasing register pressure mainly because nothing figures that a_2 + 1 > > in the dereference can be replaced by prephitmp_12 ... > > > > So this is a missed SLSR opportunity or, in this simple form, a missed > > PRE/CSE opportunity. Fixing that with the following (otherwise untested) > > restores good code generation for the testcase: > > > > Index: gcc/tree-ssa-pre.c > > === > > --- gcc/tree-ssa-pre.c (revision 246414) > > +++ gcc/tree-ssa-pre.c (working copy) > > @@ -4636,6 +4610,35 @@ eliminate_dom_walker::before_dom_childre > > } > > } > > > > + if (gimple_has_ops (stmt)) > > + for (unsigned i = 0; i < gimple_num_ops (stmt); ++i) > > + { > > + tree op = gimple_op (stmt, i); > > + if (op) > > + op = get_base_address (op); > > + if (op > > + && TREE_CODE (op) == MEM_REF > > + && ! integer_zerop (TREE_OPERAND (op, 1))) > > + { > > + tree ops[2]; > > + vn_nary_op_t nary; > > + ops[0] = TREE_OPERAND (op, 0); > > + ops[1] = TREE_OPERAND (op, 1); > > + tree res = vn_nary_op_lookup_pieces (2, POINTER_PLUS_EXPR, > > +TREE_TYPE (ops[0]), > > +ops, ); > > + if (res && TREE_CODE (res) == SSA_NAME) > > + res = eliminate_avail (res); > > + if (res) > > + { > > + TREE_OPERAND (op, 0) = res; > > + TREE_OPERAND (op, 1) > > + = build_int_cst (TREE_TYPE (TREE_OPERAND (op, 1)), 0); > > + gimple_set_modified (stmt, true); Add if (TREE_CODE (res) == SSA_NAME && !is_gimple_debug (stmt)) gimple_set_plf (SSA_NAME_DEF_STMT (res), NECESSARY, true); here. > > + } > > + } > > + } > > + > >if (gimple_modified_p (stmt)) > > { > > /* If a formerly non-invariant ADDR_EXPR is turned into an > > > > note that in general optimzing > > > >q = p + 1; > > = ...*(p + 1); > > > > "back" to *q will be undone by forwprop/stmt folding later but in this > > case the feeding stmt is a PHI node and not a pointer-plus. It still > > means that the change might be a bit too disruptive at this point > > (we could restricit it a bit by only handling the case where we don't > > replace with a pointer-plus result). > > Thanks for your work on this! Sadly GCC ICEs with this patch: > > 0xd36f53 update_dep_bb > gcc/tree-ssa-tail-merge.c:411 > 0xd38f54 stmt_update_dep_bb > gcc/tree-ssa-tail-merge.c:429 > 0xd38f54 same_succ_hash > gcc/tree-ssa-tail-merge.c:452 > 0xd38f54 find_same_succ_bb > gcc/tree-ssa-tail-merge.c:717 > 0xd39927 find_same_succ > gcc/tree-ssa-tail-merge.c:748 > 0xd39927 init_worklist > gcc/tree-ssa-tail-merge.c:767 > 0xd39927 tail_merge_optimize(unsigned int) > gcc/tree-ssa-tail-merge.c:1726 > 0xce2d6a execute > gcc/tree-ssa-pre.c:5164 > >
[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 --- Comment #12 from Thomas Preud'homme --- (In reply to Richard Biener from comment #9) > Ah, the patches do not fix the testcase because the testcase is _not_ the > PRE-creates-IV case. It's indeed simply hoisting/PRE at work transforming > > # a_14 = PHI> if (!b) > a_8 = a_14 + 1; > > # a_2 = PHI > a_10 = a_2 + 1; > ... = *(a_2 + 1); > > to > > # a_14 = PHI > _4 = a_14 + 1; > if (b) > _3 = _4 + 1; > > # a_2 = PHI > # prephitmp_12 = PHI <_4, _3> > ... = *(a_2 + 1); > > increasing register pressure mainly because nothing figures that a_2 + 1 > in the dereference can be replaced by prephitmp_12 ... > > So this is a missed SLSR opportunity or, in this simple form, a missed > PRE/CSE opportunity. Fixing that with the following (otherwise untested) > restores good code generation for the testcase: > > Index: gcc/tree-ssa-pre.c > === > --- gcc/tree-ssa-pre.c (revision 246414) > +++ gcc/tree-ssa-pre.c (working copy) > @@ -4636,6 +4610,35 @@ eliminate_dom_walker::before_dom_childre > } > } > > + if (gimple_has_ops (stmt)) > + for (unsigned i = 0; i < gimple_num_ops (stmt); ++i) > + { > + tree op = gimple_op (stmt, i); > + if (op) > + op = get_base_address (op); > + if (op > + && TREE_CODE (op) == MEM_REF > + && ! integer_zerop (TREE_OPERAND (op, 1))) > + { > + tree ops[2]; > + vn_nary_op_t nary; > + ops[0] = TREE_OPERAND (op, 0); > + ops[1] = TREE_OPERAND (op, 1); > + tree res = vn_nary_op_lookup_pieces (2, POINTER_PLUS_EXPR, > +TREE_TYPE (ops[0]), > +ops, ); > + if (res && TREE_CODE (res) == SSA_NAME) > + res = eliminate_avail (res); > + if (res) > + { > + TREE_OPERAND (op, 0) = res; > + TREE_OPERAND (op, 1) > + = build_int_cst (TREE_TYPE (TREE_OPERAND (op, 1)), 0); > + gimple_set_modified (stmt, true); > + } > + } > + } > + >if (gimple_modified_p (stmt)) > { > /* If a formerly non-invariant ADDR_EXPR is turned into an > > note that in general optimzing > >q = p + 1; > = ...*(p + 1); > > "back" to *q will be undone by forwprop/stmt folding later but in this > case the feeding stmt is a PHI node and not a pointer-plus. It still > means that the change might be a bit too disruptive at this point > (we could restricit it a bit by only handling the case where we don't > replace with a pointer-plus result). Thanks for your work on this! Sadly GCC ICEs with this patch: 0xd36f53 update_dep_bb gcc/tree-ssa-tail-merge.c:411 0xd38f54 stmt_update_dep_bb gcc/tree-ssa-tail-merge.c:429 0xd38f54 same_succ_hash gcc/tree-ssa-tail-merge.c:452 0xd38f54 find_same_succ_bb gcc/tree-ssa-tail-merge.c:717 0xd39927 find_same_succ gcc/tree-ssa-tail-merge.c:748 0xd39927 init_worklist gcc/tree-ssa-tail-merge.c:767 0xd39927 tail_merge_optimize(unsigned int) gcc/tree-ssa-tail-merge.c:1726 0xce2d6a execute gcc/tree-ssa-pre.c:5164
[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 --- Comment #11 from Richard Biener --- Ok, the 2nd level opportunitues are important. So the 2nd candidate remains for the loop carried dependencies issue(s). Regresses (on x86_64): Running target unix/{,-m32} FAIL: gcc.dg/store-motion-fgcse-sm.c scan-rtl-dump store_motion "STORE_MOTION of f, .* basic blocks, 1 insns deleted, 1 insns created" FAIL: gcc.dg/tree-ssa/loadpre10.c scan-tree-dump-times pre "Eliminated: 1" 1 FAIL: gcc.dg/tree-ssa/loadpre23.c scan-tree-dump-times pre "Eliminated: 1" 1 FAIL: gcc.dg/tree-ssa/loadpre24.c scan-tree-dump-times pre "Eliminated: 1" 1 FAIL: gcc.dg/tree-ssa/loadpre25.c scan-tree-dump-times pre "Eliminated: 1" 1 FAIL: gcc.dg/tree-ssa/loadpre4.c scan-tree-dump-times pre "Eliminated: 1" 1 FAIL: gcc.dg/tree-ssa/loadpre6.c scan-tree-dump-times pre "Eliminated: 1" 1 FAIL: gcc.dg/tree-ssa/pr21417.c scan-tree-dump-times thread4 "FSM jump thread" 1 FAIL: gcc.dg/tree-ssa/pr71347.c scan-tree-dump-not optimized ".* = MEM.*;" FAIL: gcc.dg/tree-ssa/ssa-pre-23.c scan-tree-dump pre "Eliminated: 3" FAIL: gcc.target/i386/pr49781-1.c scan-assembler-not lea[lq]?[ \\t]((%|)r[a-z0-9]* most of them are expected and looking for the feature we disabled. loadpre6, pr21417.c and pr49781-1.c need a closer look.
[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 --- Comment #10 from Richard Biener --- asm difference -fno-code-hoisting (-) against patch (+) shows --- t.s 2017-03-23 13:04:11.010514228 +0100 +++ t.s.ok 2017-03-23 13:04:05.882432743 +0100 @@ -26,9 +26,9 @@ ldrbr3, [r4]@ zero_extendqisi2 cbz r3, .L2 .L4: - bl fn2 - ldrbr3, [r4, #1]@ zero_extendqisi2 addsr4, r4, #1 + bl fn2 + ldrbr3, [r4]@ zero_extendqisi2 cmp r3, #0 bne .L4 .L2: which probably is better as it only keeps r4 live over the call? but it adds a dependence between the load from r4 and the add.
[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 --- Comment #9 from Richard Biener --- Ah, the patches do not fix the testcase because the testcase is _not_ the PRE-creates-IV case. It's indeed simply hoisting/PRE at work transforming # a_14 = PHIif (!b) a_8 = a_14 + 1; # a_2 = PHI a_10 = a_2 + 1; ... = *(a_2 + 1); to # a_14 = PHI _4 = a_14 + 1; if (b) _3 = _4 + 1; # a_2 = PHI # prephitmp_12 = PHI <_4, _3> ... = *(a_2 + 1); increasing register pressure mainly because nothing figures that a_2 + 1 in the dereference can be replaced by prephitmp_12 ... So this is a missed SLSR opportunity or, in this simple form, a missed PRE/CSE opportunity. Fixing that with the following (otherwise untested) restores good code generation for the testcase: Index: gcc/tree-ssa-pre.c === --- gcc/tree-ssa-pre.c (revision 246414) +++ gcc/tree-ssa-pre.c (working copy) @@ -4636,6 +4610,35 @@ eliminate_dom_walker::before_dom_childre } } + if (gimple_has_ops (stmt)) + for (unsigned i = 0; i < gimple_num_ops (stmt); ++i) + { + tree op = gimple_op (stmt, i); + if (op) + op = get_base_address (op); + if (op + && TREE_CODE (op) == MEM_REF + && ! integer_zerop (TREE_OPERAND (op, 1))) + { + tree ops[2]; + vn_nary_op_t nary; + ops[0] = TREE_OPERAND (op, 0); + ops[1] = TREE_OPERAND (op, 1); + tree res = vn_nary_op_lookup_pieces (2, POINTER_PLUS_EXPR, +TREE_TYPE (ops[0]), +ops, ); + if (res && TREE_CODE (res) == SSA_NAME) + res = eliminate_avail (res); + if (res) + { + TREE_OPERAND (op, 0) = res; + TREE_OPERAND (op, 1) + = build_int_cst (TREE_TYPE (TREE_OPERAND (op, 1)), 0); + gimple_set_modified (stmt, true); + } + } + } + if (gimple_modified_p (stmt)) { /* If a formerly non-invariant ADDR_EXPR is turned into an note that in general optimzing q = p + 1; = ...*(p + 1); "back" to *q will be undone by forwprop/stmt folding later but in this case the feeding stmt is a PHI node and not a pointer-plus. It still means that the change might be a bit too disruptive at this point (we could restricit it a bit by only handling the case where we don't replace with a pointer-plus result).
[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 --- Comment #8 from Thomas Preud'homme --- (In reply to rguent...@suse.de from comment #6 and #7) > > The testcase shows hardly profitable PRE which the patches should > disable (I didn't verify the patches fix the testcase!) The patch does not help for the testcase either (at least the Cortex-M7 one). > > If you're benchmarking at -O2 can you add -fpredictive-commoning? As > said, the patch removes important loop optimizations from PRE. Sorry, it does not make any difference (either for the benchmark or the testcase).
[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 --- Comment #7 from rguenther at suse dot de --- On Thu, 23 Mar 2017, thopre01 at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 > > --- Comment #5 from Thomas Preud'homme --- > (In reply to Richard Biener from comment #4) > > Can you benchmark the 2nd candidate (most appropriate at this stage I think, > > would also fix 77498). > > > > It will cause pessimizations for code that benefits from predictive > > commoning transforms that PRE can do at -O2 because predictive commoning > > doesn't run at -O2. > > The fix for that would be to run predictive commoning at -O2 but in a mode > > that doesn't perform unrolling for example. [even pcom can blow through > > register pressure limits though] > > > > Any other approach to fixing this particular bug cannot be done for GCC 7 > > (even this one is quite late ...). So in the end I'd vote for WONTFIX for > > GCC 7. > > Sadly it doesn't. Well not quite true, there is a very slight improvement but > we're far from the uplift -fno-code-hoisting provides. > > There's maybe more than one bug at play and the reduced testcase only show > one. If you're benchmarking at -O2 can you add -fpredictive-commoning? As said, the patch removes important loop optimizations from PRE.
[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 --- Comment #6 from rguenther at suse dot de --- On Thu, 23 Mar 2017, thopre01 at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 > > --- Comment #5 from Thomas Preud'homme --- > (In reply to Richard Biener from comment #4) > > Can you benchmark the 2nd candidate (most appropriate at this stage I think, > > would also fix 77498). > > > > It will cause pessimizations for code that benefits from predictive > > commoning transforms that PRE can do at -O2 because predictive commoning > > doesn't run at -O2. > > The fix for that would be to run predictive commoning at -O2 but in a mode > > that doesn't perform unrolling for example. [even pcom can blow through > > register pressure limits though] > > > > Any other approach to fixing this particular bug cannot be done for GCC 7 > > (even this one is quite late ...). So in the end I'd vote for WONTFIX for > > GCC 7. > > Sadly it doesn't. Well not quite true, there is a very slight improvement but > we're far from the uplift -fno-code-hoisting provides. > > There's maybe more than one bug at play and the reduced testcase only show > one. The testcase shows hardly profitable PRE which the patches should disable (I didn't verify the patches fix the testcase!) code-hoisting (as well as CSE) can increase register pressure and neither transform is register-pressure aware (and the patch doesn't try to mitigate that).
[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 --- Comment #5 from Thomas Preud'homme --- (In reply to Richard Biener from comment #4) > Can you benchmark the 2nd candidate (most appropriate at this stage I think, > would also fix 77498). > > It will cause pessimizations for code that benefits from predictive > commoning transforms that PRE can do at -O2 because predictive commoning > doesn't run at -O2. > The fix for that would be to run predictive commoning at -O2 but in a mode > that doesn't perform unrolling for example. [even pcom can blow through > register pressure limits though] > > Any other approach to fixing this particular bug cannot be done for GCC 7 > (even this one is quite late ...). So in the end I'd vote for WONTFIX for > GCC 7. Sadly it doesn't. Well not quite true, there is a very slight improvement but we're far from the uplift -fno-code-hoisting provides. There's maybe more than one bug at play and the reduced testcase only show one.
[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 Richard Biener changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org --- Comment #4 from Richard Biener --- Can you benchmark the 2nd candidate (most appropriate at this stage I think, would also fix 77498). It will cause pessimizations for code that benefits from predictive commoning transforms that PRE can do at -O2 because predictive commoning doesn't run at -O2. The fix for that would be to run predictive commoning at -O2 but in a mode that doesn't perform unrolling for example. [even pcom can blow through register pressure limits though] Any other approach to fixing this particular bug cannot be done for GCC 7 (even this one is quite late ...). So in the end I'd vote for WONTFIX for GCC 7.
[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 --- Comment #3 from Richard Biener --- Created attachment 41031 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41031=edit second patch candidate Second candidate, instead of inhibiting PHI insertions this limits use of them during elimination. This should allow existing 2nd level opportunities to continue to be handled.
[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 --- Comment #2 from Richard Biener --- Created attachment 41030 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41030=edit first patch candidate First variant of an aggressive approach - never add IVs / loop carried dependencies in PRE. This first approach is leaner on compile-time / memory-use in PRE but eventually misses 2nd level optimizations by not performing PHI insertions on loop headers.
[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 Richard Biener changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2017-03-23 Target Milestone|--- |7.0 Ever confirmed|0 |1 --- Comment #1 from Richard Biener --- Code-hoisting triggers a latent pessimization PRE performs: Transform: [85.00%]: # a_14 = PHIif (b_7(D) != 0) goto ; [50.00%] else goto ; [50.00%] [42.50%]: goto ; [100.00%] [42.50%]: a_8 = a_14 + 1; [85.00%]: # a_2 = PHI fn2 (); a_10 = a_2 + 1; to [85.00%]: # a_14 = PHI _4 = a_14 + 1; if (b_7(D) != 0) goto ; [50.00%] else goto ; [50.00%] [42.50%]: _3 = _4 + 1; [85.00%]: # a_2 = PHI # prephitmp_12 = PHI <_4(3), _3(4)> fn2 (); that's because the hoisting (which itself isn't a problem) makes a_2 + 1 partially redundant over the latch. We see this issue in related testcases where PRE can compute a constant for the first iteration value of expressions and thus inserts IVs for them. So it's nothing new and a fix would hopefully fix those cases as well.