[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled

2017-04-07 Thread thopre01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155

--- Comment #28 from Thomas Preud'homme  ---
(In reply to rguent...@suse.de from comment #27)
> On Fri, 7 Apr 2017, thopre01 at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155
> > 
> > --- Comment #26 from Thomas Preud'homme  ---
> > Yes true.
> > 
> > By the way, your first patch did improve things a bit. If it's not too 
> > invasive
> > could it be possible to commit it for GCC 7?
> 
> It's not really in the shape for being committable.  I'll see if I can
> come up with sth better early in stage1 that might be back-portable.

Thanks.

> 
> > I think I'll try the approach you suggested in
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77498#c10 and see if I get some
> > better results.
> 
> For this case distance is minimal already so I don't expect you can
> do anything (there's also not really separate dataflow for hoisting).

If I remember well the key in the benchmark was that the basic block where the
code was hoisted from could be reached by a much longer path. So although it
was hoisted to a direct predecessor, sometime that value had to be kept for a
very long time.

[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled

2017-04-07 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155

--- Comment #27 from rguenther at suse dot de  ---
On Fri, 7 Apr 2017, thopre01 at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155
> 
> --- Comment #26 from Thomas Preud'homme  ---
> Yes true.
> 
> By the way, your first patch did improve things a bit. If it's not too 
> invasive
> could it be possible to commit it for GCC 7?

It's not really in the shape for being committable.  I'll see if I can
come up with sth better early in stage1 that might be back-portable.

> I think I'll try the approach you suggested in
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77498#c10 and see if I get some
> better results.

For this case distance is minimal already so I don't expect you can
do anything (there's also not really separate dataflow for hoisting).

[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled

2017-04-07 Thread thopre01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155

--- Comment #26 from Thomas Preud'homme  ---
(In reply to rguent...@suse.de from comment #25)
> On Fri, 7 Apr 2017, thopre01 at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155
> > 
> > --- Comment #24 from Thomas Preud'homme  ---
> > (In reply to rguent...@suse.de from comment #23)
> > > On Fri, 7 Apr 2017, thopre01 at gcc dot gnu.org wrote:
> > > 
> > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155
> > > > 
> > > > --- Comment #22 from Thomas Preud'homme  
> > > > ---
> > > > (In reply to Thomas Preud'homme from comment #21)
> > > > > 
> > > > > I can see this behavior for Cortex-M0+ indeed but the results are 
> > > > > different
> > > > > for Cortex-M7 for me:
> > > > > 
> > > > > 
> > > > > % arm-none-linux-gnueabi-gcc -S ~/cortex-m0plus_reproducer.c -O2
> > > > > -mcpu=cortex-m7 -mthumb -fdump-tree-pre-details -fdump-tree-crited 
> > > > > -fdump-tree-all -funroll-all-loops
> > > > > 
> > > > > % grep push cortex-m0plus_reproducer.s
> > > > > push{r4, r5}
> > > > > 
> > > > > 
> > > > > 
> > > > > % arm-none-linux-gnueabi-gcc -S ~/cortex-m0plus_reproducer.c -O2
> > > > > -mcpu=cortex-m7 -mthumb -fdump-tree-pre-details -fdump-tree-crited 
> > > > > -fdump-tree-all -funroll-all-loops -fno-code-hoisting
> > > > > 
> > > > > % grep push cortex-m0plus_reproducer.s
> > > > > push{r4}
> > > > > 
> > > > > Would you mind checking again?
> > > > 
> > > > btw, this toolchain was built on 2017-04-04 in case that's the issue.
> > > 
> > > Ok, I can reproduce it now.  With the proposed patch to fix the other
> > > case it regresses with -fno-code-hoisting though (comment #9 plus fix).
> > > 
> > > Note that code-hosting is simply doing it's job here, hoisting common
> > > code out of a switch.
> > 
> > I understand that. Code is smaller so it's definitely worthwhile at -Os. 
> > But it
> > sometimes lead to worse performance due to more spilling.
> 
> But there's no good way to estimate that.  We have tons of transforms
> that have the issue of being "sometimes worse"

Yes true.

By the way, your first patch did improve things a bit. If it's not too invasive
could it be possible to commit it for GCC 7?

I think I'll try the approach you suggested in
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77498#c10 and see if I get some
better results.

[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled

2017-04-07 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155

--- Comment #25 from rguenther at suse dot de  ---
On Fri, 7 Apr 2017, thopre01 at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155
> 
> --- Comment #24 from Thomas Preud'homme  ---
> (In reply to rguent...@suse.de from comment #23)
> > On Fri, 7 Apr 2017, thopre01 at gcc dot gnu.org wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155
> > > 
> > > --- Comment #22 from Thomas Preud'homme  ---
> > > (In reply to Thomas Preud'homme from comment #21)
> > > > 
> > > > I can see this behavior for Cortex-M0+ indeed but the results are 
> > > > different
> > > > for Cortex-M7 for me:
> > > > 
> > > > 
> > > > % arm-none-linux-gnueabi-gcc -S ~/cortex-m0plus_reproducer.c -O2
> > > > -mcpu=cortex-m7 -mthumb -fdump-tree-pre-details -fdump-tree-crited 
> > > > -fdump-tree-all -funroll-all-loops
> > > > 
> > > > % grep push cortex-m0plus_reproducer.s
> > > > push{r4, r5}
> > > > 
> > > > 
> > > > 
> > > > % arm-none-linux-gnueabi-gcc -S ~/cortex-m0plus_reproducer.c -O2
> > > > -mcpu=cortex-m7 -mthumb -fdump-tree-pre-details -fdump-tree-crited 
> > > > -fdump-tree-all -funroll-all-loops -fno-code-hoisting
> > > > 
> > > > % grep push cortex-m0plus_reproducer.s
> > > > push{r4}
> > > > 
> > > > Would you mind checking again?
> > > 
> > > btw, this toolchain was built on 2017-04-04 in case that's the issue.
> > 
> > Ok, I can reproduce it now.  With the proposed patch to fix the other
> > case it regresses with -fno-code-hoisting though (comment #9 plus fix).
> > 
> > Note that code-hosting is simply doing it's job here, hoisting common
> > code out of a switch.
> 
> I understand that. Code is smaller so it's definitely worthwhile at -Os. But 
> it
> sometimes lead to worse performance due to more spilling.

But there's no good way to estimate that.  We have tons of transforms
that have the issue of being "sometimes worse"

[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled

2017-04-07 Thread thopre01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155

--- Comment #24 from Thomas Preud'homme  ---
(In reply to rguent...@suse.de from comment #23)
> On Fri, 7 Apr 2017, thopre01 at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155
> > 
> > --- Comment #22 from Thomas Preud'homme  ---
> > (In reply to Thomas Preud'homme from comment #21)
> > > 
> > > I can see this behavior for Cortex-M0+ indeed but the results are 
> > > different
> > > for Cortex-M7 for me:
> > > 
> > > 
> > > % arm-none-linux-gnueabi-gcc -S ~/cortex-m0plus_reproducer.c -O2
> > > -mcpu=cortex-m7 -mthumb -fdump-tree-pre-details -fdump-tree-crited 
> > > -fdump-tree-all -funroll-all-loops
> > > 
> > > % grep push cortex-m0plus_reproducer.s
> > > push{r4, r5}
> > > 
> > > 
> > > 
> > > % arm-none-linux-gnueabi-gcc -S ~/cortex-m0plus_reproducer.c -O2
> > > -mcpu=cortex-m7 -mthumb -fdump-tree-pre-details -fdump-tree-crited 
> > > -fdump-tree-all -funroll-all-loops -fno-code-hoisting
> > > 
> > > % grep push cortex-m0plus_reproducer.s
> > > push{r4}
> > > 
> > > Would you mind checking again?
> > 
> > btw, this toolchain was built on 2017-04-04 in case that's the issue.
> 
> Ok, I can reproduce it now.  With the proposed patch to fix the other
> case it regresses with -fno-code-hoisting though (comment #9 plus fix).
> 
> Note that code-hosting is simply doing it's job here, hoisting common
> code out of a switch.

I understand that. Code is smaller so it's definitely worthwhile at -Os. But it
sometimes lead to worse performance due to more spilling.

[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled

2017-04-07 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155

--- Comment #23 from rguenther at suse dot de  ---
On Fri, 7 Apr 2017, thopre01 at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155
> 
> --- Comment #22 from Thomas Preud'homme  ---
> (In reply to Thomas Preud'homme from comment #21)
> > 
> > I can see this behavior for Cortex-M0+ indeed but the results are different
> > for Cortex-M7 for me:
> > 
> > 
> > % arm-none-linux-gnueabi-gcc -S ~/cortex-m0plus_reproducer.c -O2
> > -mcpu=cortex-m7 -mthumb -fdump-tree-pre-details -fdump-tree-crited 
> > -fdump-tree-all -funroll-all-loops
> > 
> > % grep push cortex-m0plus_reproducer.s
> > push{r4, r5}
> > 
> > 
> > 
> > % arm-none-linux-gnueabi-gcc -S ~/cortex-m0plus_reproducer.c -O2
> > -mcpu=cortex-m7 -mthumb -fdump-tree-pre-details -fdump-tree-crited 
> > -fdump-tree-all -funroll-all-loops -fno-code-hoisting
> > 
> > % grep push cortex-m0plus_reproducer.s
> > push{r4}
> > 
> > Would you mind checking again?
> 
> btw, this toolchain was built on 2017-04-04 in case that's the issue.

Ok, I can reproduce it now.  With the proposed patch to fix the other
case it regresses with -fno-code-hoisting though (comment #9 plus fix).

Note that code-hosting is simply doing it's job here, hoisting common
code out of a switch.

[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled

2017-04-07 Thread thopre01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155

--- Comment #22 from Thomas Preud'homme  ---
(In reply to Thomas Preud'homme from comment #21)
> 
> I can see this behavior for Cortex-M0+ indeed but the results are different
> for Cortex-M7 for me:
> 
> 
> % arm-none-linux-gnueabi-gcc -S ~/cortex-m0plus_reproducer.c -O2
> -mcpu=cortex-m7 -mthumb -fdump-tree-pre-details -fdump-tree-crited 
> -fdump-tree-all -funroll-all-loops
> 
> % grep push cortex-m0plus_reproducer.s
> push{r4, r5}
> 
> 
> 
> % arm-none-linux-gnueabi-gcc -S ~/cortex-m0plus_reproducer.c -O2
> -mcpu=cortex-m7 -mthumb -fdump-tree-pre-details -fdump-tree-crited 
> -fdump-tree-all -funroll-all-loops -fno-code-hoisting
> 
> % grep push cortex-m0plus_reproducer.s
> push{r4}
> 
> Would you mind checking again?

btw, this toolchain was built on 2017-04-04 in case that's the issue.

[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled

2017-04-07 Thread thopre01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155

--- Comment #21 from Thomas Preud'homme  ---
(In reply to Richard Biener from comment #20)
> (In reply to Thomas Preud'homme from comment #19)
> > (In reply to Richard Biener from comment #18)
> > > (In reply to Thomas Preud'homme from comment #17)
> > > > (In reply to rguent...@suse.de from comment #16)
> > > > > On Thu, 23 Mar 2017, thopre01 at gcc dot gnu.org wrote:
> > > > > > 
> > > > > > Funnily this led back to the Cortex-M0+ reduced testcase. With the 
> > > > > > patch in
> > > > > > comment #13 applied we can still see a difference in the push (one 
> > > > > > register
> > > > > > pushed Vs 0).
> > > > > 
> > > > > I can't reproduce zero pushes here I get three with/without 
> > > > > -fno-code-hoisting.  code hoisting hoists the two loads inside
> > > > > the switch before it so we have
> > > > 
> > > > Ooops my apologize, it needs more flags indeed. -O2 -funroll-all-loops 
> > > > shows
> > > > 2 registers pushed Vs 1 when -fno-code-hoisting is added.
> > > 
> > > Still can't reproduce.  I've configured with
> > > 
> > > /space/rguenther/src/svn/gcc-7-branch/configure
> > > --target=arm-suse-linux-gnueabi --disable-libstdcxx-pch
> > > --enable-languages=c,c++
> > > 
> > > and am using
> > > 
> > > ./cc1 -quiet cortex-m0plus_reproducer.c -O2 -mcpu=cortex-m0plus -mthumb -I
> > > include  -fdump-tree-pre-details -fdump-tree-crited  -fdump-tree-all
> > > -funroll-all-loops -fno-code-hoisting
> > 
> > Oh my bad, I was still not clear enough. I meant using the Cortex-M0+
> > testcase but build it for Cortex-M7. Hopefully with -mcpu=cortex-m7 you
> > should see a difference between with and without code hoisting.
> 
> I do.  Two pushes with code hoisting and three without.  Changed -mcpu above
> to -mcpu=cortex-m7.

I can see this behavior for Cortex-M0+ indeed but the results are different for
Cortex-M7 for me:


% arm-none-linux-gnueabi-gcc -S ~/cortex-m0plus_reproducer.c -O2
-mcpu=cortex-m7 -mthumb -fdump-tree-pre-details -fdump-tree-crited 
-fdump-tree-all -funroll-all-loops

% grep push cortex-m0plus_reproducer.s
push{r4, r5}



% arm-none-linux-gnueabi-gcc -S ~/cortex-m0plus_reproducer.c -O2
-mcpu=cortex-m7 -mthumb -fdump-tree-pre-details -fdump-tree-crited 
-fdump-tree-all -funroll-all-loops -fno-code-hoisting

% grep push cortex-m0plus_reproducer.s
push{r4}

Would you mind checking again?

[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled

2017-03-31 Thread law at redhat dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155

Jeffrey A. Law  changed:

   What|Removed |Added

   Priority|P3  |P2
 CC||law at redhat dot com

[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled

2017-03-28 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155

--- Comment #20 from Richard Biener  ---
(In reply to Thomas Preud'homme from comment #19)
> (In reply to Richard Biener from comment #18)
> > (In reply to Thomas Preud'homme from comment #17)
> > > (In reply to rguent...@suse.de from comment #16)
> > > > On Thu, 23 Mar 2017, thopre01 at gcc dot gnu.org wrote:
> > > > > 
> > > > > Funnily this led back to the Cortex-M0+ reduced testcase. With the 
> > > > > patch in
> > > > > comment #13 applied we can still see a difference in the push (one 
> > > > > register
> > > > > pushed Vs 0).
> > > > 
> > > > I can't reproduce zero pushes here I get three with/without 
> > > > -fno-code-hoisting.  code hoisting hoists the two loads inside
> > > > the switch before it so we have
> > > 
> > > Ooops my apologize, it needs more flags indeed. -O2 -funroll-all-loops 
> > > shows
> > > 2 registers pushed Vs 1 when -fno-code-hoisting is added.
> > 
> > Still can't reproduce.  I've configured with
> > 
> > /space/rguenther/src/svn/gcc-7-branch/configure
> > --target=arm-suse-linux-gnueabi --disable-libstdcxx-pch
> > --enable-languages=c,c++
> > 
> > and am using
> > 
> > ./cc1 -quiet cortex-m0plus_reproducer.c -O2 -mcpu=cortex-m0plus -mthumb -I
> > include  -fdump-tree-pre-details -fdump-tree-crited  -fdump-tree-all
> > -funroll-all-loops -fno-code-hoisting
> 
> Oh my bad, I was still not clear enough. I meant using the Cortex-M0+
> testcase but build it for Cortex-M7. Hopefully with -mcpu=cortex-m7 you
> should see a difference between with and without code hoisting.

I do.  Two pushes with code hoisting and three without.  Changed -mcpu above
to -mcpu=cortex-m7.

[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled

2017-03-28 Thread thopre01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155

--- Comment #19 from Thomas Preud'homme  ---
(In reply to Richard Biener from comment #18)
> (In reply to Thomas Preud'homme from comment #17)
> > (In reply to rguent...@suse.de from comment #16)
> > > On Thu, 23 Mar 2017, thopre01 at gcc dot gnu.org wrote:
> > > > 
> > > > Funnily this led back to the Cortex-M0+ reduced testcase. With the 
> > > > patch in
> > > > comment #13 applied we can still see a difference in the push (one 
> > > > register
> > > > pushed Vs 0).
> > > 
> > > I can't reproduce zero pushes here I get three with/without 
> > > -fno-code-hoisting.  code hoisting hoists the two loads inside
> > > the switch before it so we have
> > 
> > Ooops my apologize, it needs more flags indeed. -O2 -funroll-all-loops shows
> > 2 registers pushed Vs 1 when -fno-code-hoisting is added.
> 
> Still can't reproduce.  I've configured with
> 
> /space/rguenther/src/svn/gcc-7-branch/configure
> --target=arm-suse-linux-gnueabi --disable-libstdcxx-pch
> --enable-languages=c,c++
> 
> and am using
> 
> ./cc1 -quiet cortex-m0plus_reproducer.c -O2 -mcpu=cortex-m0plus -mthumb -I
> include  -fdump-tree-pre-details -fdump-tree-crited  -fdump-tree-all
> -funroll-all-loops -fno-code-hoisting

Oh my bad, I was still not clear enough. I meant using the Cortex-M0+ testcase
but build it for Cortex-M7. Hopefully with -mcpu=cortex-m7 you should see a
difference between with and without code hoisting.

[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled

2017-03-28 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155

--- Comment #18 from Richard Biener  ---
(In reply to Thomas Preud'homme from comment #17)
> (In reply to rguent...@suse.de from comment #16)
> > On Thu, 23 Mar 2017, thopre01 at gcc dot gnu.org wrote:
> > > 
> > > Funnily this led back to the Cortex-M0+ reduced testcase. With the patch 
> > > in
> > > comment #13 applied we can still see a difference in the push (one 
> > > register
> > > pushed Vs 0).
> > 
> > I can't reproduce zero pushes here I get three with/without 
> > -fno-code-hoisting.  code hoisting hoists the two loads inside
> > the switch before it so we have
> 
> Ooops my apologize, it needs more flags indeed. -O2 -funroll-all-loops shows
> 2 registers pushed Vs 1 when -fno-code-hoisting is added.

Still can't reproduce.  I've configured with

/space/rguenther/src/svn/gcc-7-branch/configure --target=arm-suse-linux-gnueabi
--disable-libstdcxx-pch --enable-languages=c,c++

and am using

./cc1 -quiet cortex-m0plus_reproducer.c -O2 -mcpu=cortex-m0plus -mthumb -I
include  -fdump-tree-pre-details -fdump-tree-crited  -fdump-tree-all
-funroll-all-loops -fno-code-hoisting

seeing 3 pushes:

fn1:
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
movsr3, #0
push{r4, r5, lr}

I see softfp being used, not sure how I could change that (if that's what I'm
missing).  Btw, -funroll-all-loops disqualifies this testcase ;)

[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled

2017-03-24 Thread thopre01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155

--- Comment #17 from Thomas Preud'homme  ---
(In reply to rguent...@suse.de from comment #16)
> On Thu, 23 Mar 2017, thopre01 at gcc dot gnu.org wrote:
> > 
> > Funnily this led back to the Cortex-M0+ reduced testcase. With the patch in
> > comment #13 applied we can still see a difference in the push (one register
> > pushed Vs 0).
> 
> I can't reproduce zero pushes here I get three with/without 
> -fno-code-hoisting.  code hoisting hoists the two loads inside
> the switch before it so we have

Ooops my apologize, it needs more flags indeed. -O2 -funroll-all-loops shows 2
registers pushed Vs 1 when -fno-code-hoisting is added.

[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled

2017-03-24 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155

--- Comment #16 from rguenther at suse dot de  ---
On Thu, 23 Mar 2017, thopre01 at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155
> 
> --- Comment #15 from Thomas Preud'homme  ---
> (In reply to Thomas Preud'homme from comment #14)
> > (In reply to rguent...@suse.de from comment #13)
> > > On Thu, 23 Mar 2017, thopre01 at gcc dot gnu.org wrote:
> > > 
> > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155
> > > > 
> > > > --- Comment #12 from Thomas Preud'homme  
> > > > ---
> > > > (In reply to Richard Biener from comment #9)
> > > > > Ah, the patches do not fix the testcase because the testcase is _not_ 
> > > > > the
> > > > > PRE-creates-IV case.  It's indeed simply hoisting/PRE at work 
> > > > > transforming
> > > > > 
> > > > >   # a_14 = PHI 
> > > > >   if (!b)
> > > > > a_8 = a_14 + 1;
> > > > > 
> > > > >   # a_2 = PHI 
> > > > >   a_10 = a_2 + 1;
> > > > >   ... = *(a_2 + 1);
> > > > > 
> > > > > to
> > > > > 
> > > > >   # a_14 = PHI 
> > > > >   _4 = a_14 + 1;
> > > > >   if (b)
> > > > > _3 = _4 + 1;
> > > > > 
> > > > >   # a_2 = PHI 
> > > > >   # prephitmp_12 = PHI <_4, _3>
> > > > >   ... = *(a_2 + 1);
> > > > > 
> > > > > increasing register pressure mainly because nothing figures that a_2 
> > > > > + 1
> > > > > in the dereference can be replaced by prephitmp_12 ...
> > > > > 
> > > > > So this is a missed SLSR opportunity or, in this simple form, a missed
> > > > > PRE/CSE opportunity.  Fixing that with the following (otherwise 
> > > > > untested)
> > > > > restores good code generation for the testcase:
> > > > > 
> > > > > Index: gcc/tree-ssa-pre.c
> > > > > ===
> > > > > --- gcc/tree-ssa-pre.c  (revision 246414)
> > > > > +++ gcc/tree-ssa-pre.c  (working copy)
> > > > > @@ -4636,6 +4610,35 @@ eliminate_dom_walker::before_dom_childre
> > > > > }
> > > > > }
> > > > >  
> > > > > +  if (gimple_has_ops (stmt))
> > > > > +   for (unsigned i = 0; i < gimple_num_ops (stmt); ++i)
> > > > > + {
> > > > > +   tree op = gimple_op (stmt, i);
> > > > > +   if (op)
> > > > > + op = get_base_address (op);
> > > > > +   if (op
> > > > > +   && TREE_CODE (op) == MEM_REF
> > > > > +   && ! integer_zerop (TREE_OPERAND (op, 1)))
> > > > > + {
> > > > > +   tree ops[2];
> > > > > +   vn_nary_op_t nary;
> > > > > +   ops[0] = TREE_OPERAND (op, 0);
> > > > > +   ops[1] = TREE_OPERAND (op, 1);
> > > > > +   tree res = vn_nary_op_lookup_pieces (2, 
> > > > > POINTER_PLUS_EXPR,
> > > > > +TREE_TYPE 
> > > > > (ops[0]),
> > > > > +ops, );
> > > > > +   if (res && TREE_CODE (res) == SSA_NAME)
> > > > > + res = eliminate_avail (res);
> > > > > +   if (res)
> > > > > + {
> > > > > +   TREE_OPERAND (op, 0) = res;
> > > > > +   TREE_OPERAND (op, 1)
> > > > > + = build_int_cst (TREE_TYPE (TREE_OPERAND (op, 
> > > > > 1)), 0);
> > > > > +   gimple_set_modified (stmt, true);
> > > 
> > > Add
> > > 
> > > if (TREE_CODE (res) == SSA_NAME
> > > && !is_gimple_debug (stmt))
> > >   gimple_set_plf (SSA_NAME_DEF_STMT (res),
> > >   NECESSARY, true);
> > > 
> > > here.
> > > 
> > > > > + }
> > > > > + }
> > > > > + }
> > > > > +
> > > > >if (gimple_modified_p (stmt))
> > > > > {
> > > > >   /* If a formerly non-invariant ADDR_EXPR is turned into an
> > > > > 
> > > > > note that in general optimzing
> > > > > 
> > > > >q = p + 1;
> > > > >  = ...*(p + 1);
> > > > > 
> > > > > "back" to *q will be undone by forwprop/stmt folding later but in this
> > > > > case the feeding stmt is a PHI node and not a pointer-plus.  It still
> > > > > means that the change might be a bit too disruptive at this point
> > > > > (we could restricit it a bit by only handling the case where we don't
> > > > > replace with a pointer-plus result).
> > > > 
> > > > Thanks for your work on this! Sadly GCC ICEs with this patch:
> > > > 
> > > > 0xd36f53 update_dep_bb
> > > > gcc/tree-ssa-tail-merge.c:411
> > > > 0xd38f54 stmt_update_dep_bb
> > > > gcc/tree-ssa-tail-merge.c:429
> > > > 0xd38f54 same_succ_hash
> > > > gcc/tree-ssa-tail-merge.c:452
> > > > 0xd38f54 find_same_succ_bb
> > > > gcc/tree-ssa-tail-merge.c:717
> > > > 0xd39927 find_same_succ
> > > > gcc/tree-ssa-tail-merge.c:748
> > > > 0xd39927 init_worklist
> > > > 

[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled

2017-03-23 Thread thopre01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155

--- Comment #15 from Thomas Preud'homme  ---
(In reply to Thomas Preud'homme from comment #14)
> (In reply to rguent...@suse.de from comment #13)
> > On Thu, 23 Mar 2017, thopre01 at gcc dot gnu.org wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155
> > > 
> > > --- Comment #12 from Thomas Preud'homme  ---
> > > (In reply to Richard Biener from comment #9)
> > > > Ah, the patches do not fix the testcase because the testcase is _not_ 
> > > > the
> > > > PRE-creates-IV case.  It's indeed simply hoisting/PRE at work 
> > > > transforming
> > > > 
> > > >   # a_14 = PHI 
> > > >   if (!b)
> > > > a_8 = a_14 + 1;
> > > > 
> > > >   # a_2 = PHI 
> > > >   a_10 = a_2 + 1;
> > > >   ... = *(a_2 + 1);
> > > > 
> > > > to
> > > > 
> > > >   # a_14 = PHI 
> > > >   _4 = a_14 + 1;
> > > >   if (b)
> > > > _3 = _4 + 1;
> > > > 
> > > >   # a_2 = PHI 
> > > >   # prephitmp_12 = PHI <_4, _3>
> > > >   ... = *(a_2 + 1);
> > > > 
> > > > increasing register pressure mainly because nothing figures that a_2 + 1
> > > > in the dereference can be replaced by prephitmp_12 ...
> > > > 
> > > > So this is a missed SLSR opportunity or, in this simple form, a missed
> > > > PRE/CSE opportunity.  Fixing that with the following (otherwise 
> > > > untested)
> > > > restores good code generation for the testcase:
> > > > 
> > > > Index: gcc/tree-ssa-pre.c
> > > > ===
> > > > --- gcc/tree-ssa-pre.c  (revision 246414)
> > > > +++ gcc/tree-ssa-pre.c  (working copy)
> > > > @@ -4636,6 +4610,35 @@ eliminate_dom_walker::before_dom_childre
> > > > }
> > > > }
> > > >  
> > > > +  if (gimple_has_ops (stmt))
> > > > +   for (unsigned i = 0; i < gimple_num_ops (stmt); ++i)
> > > > + {
> > > > +   tree op = gimple_op (stmt, i);
> > > > +   if (op)
> > > > + op = get_base_address (op);
> > > > +   if (op
> > > > +   && TREE_CODE (op) == MEM_REF
> > > > +   && ! integer_zerop (TREE_OPERAND (op, 1)))
> > > > + {
> > > > +   tree ops[2];
> > > > +   vn_nary_op_t nary;
> > > > +   ops[0] = TREE_OPERAND (op, 0);
> > > > +   ops[1] = TREE_OPERAND (op, 1);
> > > > +   tree res = vn_nary_op_lookup_pieces (2, 
> > > > POINTER_PLUS_EXPR,
> > > > +TREE_TYPE (ops[0]),
> > > > +ops, );
> > > > +   if (res && TREE_CODE (res) == SSA_NAME)
> > > > + res = eliminate_avail (res);
> > > > +   if (res)
> > > > + {
> > > > +   TREE_OPERAND (op, 0) = res;
> > > > +   TREE_OPERAND (op, 1)
> > > > + = build_int_cst (TREE_TYPE (TREE_OPERAND (op, 
> > > > 1)), 0);
> > > > +   gimple_set_modified (stmt, true);
> > 
> > Add
> > 
> > if (TREE_CODE (res) == SSA_NAME
> > && !is_gimple_debug (stmt))
> >   gimple_set_plf (SSA_NAME_DEF_STMT (res),
> >   NECESSARY, true);
> > 
> > here.
> > 
> > > > + }
> > > > + }
> > > > + }
> > > > +
> > > >if (gimple_modified_p (stmt))
> > > > {
> > > >   /* If a formerly non-invariant ADDR_EXPR is turned into an
> > > > 
> > > > note that in general optimzing
> > > > 
> > > >q = p + 1;
> > > >  = ...*(p + 1);
> > > > 
> > > > "back" to *q will be undone by forwprop/stmt folding later but in this
> > > > case the feeding stmt is a PHI node and not a pointer-plus.  It still
> > > > means that the change might be a bit too disruptive at this point
> > > > (we could restricit it a bit by only handling the case where we don't
> > > > replace with a pointer-plus result).
> > > 
> > > Thanks for your work on this! Sadly GCC ICEs with this patch:
> > > 
> > > 0xd36f53 update_dep_bb
> > > gcc/tree-ssa-tail-merge.c:411
> > > 0xd38f54 stmt_update_dep_bb
> > > gcc/tree-ssa-tail-merge.c:429
> > > 0xd38f54 same_succ_hash
> > > gcc/tree-ssa-tail-merge.c:452
> > > 0xd38f54 find_same_succ_bb
> > > gcc/tree-ssa-tail-merge.c:717
> > > 0xd39927 find_same_succ
> > > gcc/tree-ssa-tail-merge.c:748
> > > 0xd39927 init_worklist
> > > gcc/tree-ssa-tail-merge.c:767
> > > 0xd39927 tail_merge_optimize(unsigned int)
> > > gcc/tree-ssa-tail-merge.c:1726
> > > 0xce2d6a execute
> > > gcc/tree-ssa-pre.c:5164
> > > 
> > >
> 
> There's progress. Performance is improved but not as much as disabling code
> hoisting. I'll try to reduce the testcase again with that patch to see if I
> can find a testcase that expose all issues.

Funnily this led back to the 

[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled

2017-03-23 Thread thopre01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155

--- Comment #14 from Thomas Preud'homme  ---
(In reply to rguent...@suse.de from comment #13)
> On Thu, 23 Mar 2017, thopre01 at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155
> > 
> > --- Comment #12 from Thomas Preud'homme  ---
> > (In reply to Richard Biener from comment #9)
> > > Ah, the patches do not fix the testcase because the testcase is _not_ the
> > > PRE-creates-IV case.  It's indeed simply hoisting/PRE at work transforming
> > > 
> > >   # a_14 = PHI 
> > >   if (!b)
> > > a_8 = a_14 + 1;
> > > 
> > >   # a_2 = PHI 
> > >   a_10 = a_2 + 1;
> > >   ... = *(a_2 + 1);
> > > 
> > > to
> > > 
> > >   # a_14 = PHI 
> > >   _4 = a_14 + 1;
> > >   if (b)
> > > _3 = _4 + 1;
> > > 
> > >   # a_2 = PHI 
> > >   # prephitmp_12 = PHI <_4, _3>
> > >   ... = *(a_2 + 1);
> > > 
> > > increasing register pressure mainly because nothing figures that a_2 + 1
> > > in the dereference can be replaced by prephitmp_12 ...
> > > 
> > > So this is a missed SLSR opportunity or, in this simple form, a missed
> > > PRE/CSE opportunity.  Fixing that with the following (otherwise untested)
> > > restores good code generation for the testcase:
> > > 
> > > Index: gcc/tree-ssa-pre.c
> > > ===
> > > --- gcc/tree-ssa-pre.c  (revision 246414)
> > > +++ gcc/tree-ssa-pre.c  (working copy)
> > > @@ -4636,6 +4610,35 @@ eliminate_dom_walker::before_dom_childre
> > > }
> > > }
> > >  
> > > +  if (gimple_has_ops (stmt))
> > > +   for (unsigned i = 0; i < gimple_num_ops (stmt); ++i)
> > > + {
> > > +   tree op = gimple_op (stmt, i);
> > > +   if (op)
> > > + op = get_base_address (op);
> > > +   if (op
> > > +   && TREE_CODE (op) == MEM_REF
> > > +   && ! integer_zerop (TREE_OPERAND (op, 1)))
> > > + {
> > > +   tree ops[2];
> > > +   vn_nary_op_t nary;
> > > +   ops[0] = TREE_OPERAND (op, 0);
> > > +   ops[1] = TREE_OPERAND (op, 1);
> > > +   tree res = vn_nary_op_lookup_pieces (2, POINTER_PLUS_EXPR,
> > > +TREE_TYPE (ops[0]),
> > > +ops, );
> > > +   if (res && TREE_CODE (res) == SSA_NAME)
> > > + res = eliminate_avail (res);
> > > +   if (res)
> > > + {
> > > +   TREE_OPERAND (op, 0) = res;
> > > +   TREE_OPERAND (op, 1)
> > > + = build_int_cst (TREE_TYPE (TREE_OPERAND (op, 1)), 
> > > 0);
> > > +   gimple_set_modified (stmt, true);
> 
> Add
> 
> if (TREE_CODE (res) == SSA_NAME
> && !is_gimple_debug (stmt))
>   gimple_set_plf (SSA_NAME_DEF_STMT (res),
>   NECESSARY, true);
> 
> here.
> 
> > > + }
> > > + }
> > > + }
> > > +
> > >if (gimple_modified_p (stmt))
> > > {
> > >   /* If a formerly non-invariant ADDR_EXPR is turned into an
> > > 
> > > note that in general optimzing
> > > 
> > >q = p + 1;
> > >  = ...*(p + 1);
> > > 
> > > "back" to *q will be undone by forwprop/stmt folding later but in this
> > > case the feeding stmt is a PHI node and not a pointer-plus.  It still
> > > means that the change might be a bit too disruptive at this point
> > > (we could restricit it a bit by only handling the case where we don't
> > > replace with a pointer-plus result).
> > 
> > Thanks for your work on this! Sadly GCC ICEs with this patch:
> > 
> > 0xd36f53 update_dep_bb
> > gcc/tree-ssa-tail-merge.c:411
> > 0xd38f54 stmt_update_dep_bb
> > gcc/tree-ssa-tail-merge.c:429
> > 0xd38f54 same_succ_hash
> > gcc/tree-ssa-tail-merge.c:452
> > 0xd38f54 find_same_succ_bb
> > gcc/tree-ssa-tail-merge.c:717
> > 0xd39927 find_same_succ
> > gcc/tree-ssa-tail-merge.c:748
> > 0xd39927 init_worklist
> > gcc/tree-ssa-tail-merge.c:767
> > 0xd39927 tail_merge_optimize(unsigned int)
> > gcc/tree-ssa-tail-merge.c:1726
> > 0xce2d6a execute
> > gcc/tree-ssa-pre.c:5164
> > 
> >

There's progress. Performance is improved but not as much as disabling code
hoisting. I'll try to reduce the testcase again with that patch to see if I can
find a testcase that expose all issues.

[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled

2017-03-23 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155

--- Comment #13 from rguenther at suse dot de  ---
On Thu, 23 Mar 2017, thopre01 at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155
> 
> --- Comment #12 from Thomas Preud'homme  ---
> (In reply to Richard Biener from comment #9)
> > Ah, the patches do not fix the testcase because the testcase is _not_ the
> > PRE-creates-IV case.  It's indeed simply hoisting/PRE at work transforming
> > 
> >   # a_14 = PHI 
> >   if (!b)
> > a_8 = a_14 + 1;
> > 
> >   # a_2 = PHI 
> >   a_10 = a_2 + 1;
> >   ... = *(a_2 + 1);
> > 
> > to
> > 
> >   # a_14 = PHI 
> >   _4 = a_14 + 1;
> >   if (b)
> > _3 = _4 + 1;
> > 
> >   # a_2 = PHI 
> >   # prephitmp_12 = PHI <_4, _3>
> >   ... = *(a_2 + 1);
> > 
> > increasing register pressure mainly because nothing figures that a_2 + 1
> > in the dereference can be replaced by prephitmp_12 ...
> > 
> > So this is a missed SLSR opportunity or, in this simple form, a missed
> > PRE/CSE opportunity.  Fixing that with the following (otherwise untested)
> > restores good code generation for the testcase:
> > 
> > Index: gcc/tree-ssa-pre.c
> > ===
> > --- gcc/tree-ssa-pre.c  (revision 246414)
> > +++ gcc/tree-ssa-pre.c  (working copy)
> > @@ -4636,6 +4610,35 @@ eliminate_dom_walker::before_dom_childre
> > }
> > }
> >  
> > +  if (gimple_has_ops (stmt))
> > +   for (unsigned i = 0; i < gimple_num_ops (stmt); ++i)
> > + {
> > +   tree op = gimple_op (stmt, i);
> > +   if (op)
> > + op = get_base_address (op);
> > +   if (op
> > +   && TREE_CODE (op) == MEM_REF
> > +   && ! integer_zerop (TREE_OPERAND (op, 1)))
> > + {
> > +   tree ops[2];
> > +   vn_nary_op_t nary;
> > +   ops[0] = TREE_OPERAND (op, 0);
> > +   ops[1] = TREE_OPERAND (op, 1);
> > +   tree res = vn_nary_op_lookup_pieces (2, POINTER_PLUS_EXPR,
> > +TREE_TYPE (ops[0]),
> > +ops, );
> > +   if (res && TREE_CODE (res) == SSA_NAME)
> > + res = eliminate_avail (res);
> > +   if (res)
> > + {
> > +   TREE_OPERAND (op, 0) = res;
> > +   TREE_OPERAND (op, 1)
> > + = build_int_cst (TREE_TYPE (TREE_OPERAND (op, 1)), 0);
> > +   gimple_set_modified (stmt, true);

Add

if (TREE_CODE (res) == SSA_NAME
&& !is_gimple_debug (stmt))
  gimple_set_plf (SSA_NAME_DEF_STMT (res),
  NECESSARY, true);

here.

> > + }
> > + }
> > + }
> > +
> >if (gimple_modified_p (stmt))
> > {
> >   /* If a formerly non-invariant ADDR_EXPR is turned into an
> > 
> > note that in general optimzing
> > 
> >q = p + 1;
> >  = ...*(p + 1);
> > 
> > "back" to *q will be undone by forwprop/stmt folding later but in this
> > case the feeding stmt is a PHI node and not a pointer-plus.  It still
> > means that the change might be a bit too disruptive at this point
> > (we could restricit it a bit by only handling the case where we don't
> > replace with a pointer-plus result).
> 
> Thanks for your work on this! Sadly GCC ICEs with this patch:
> 
> 0xd36f53 update_dep_bb
> gcc/tree-ssa-tail-merge.c:411
> 0xd38f54 stmt_update_dep_bb
> gcc/tree-ssa-tail-merge.c:429
> 0xd38f54 same_succ_hash
> gcc/tree-ssa-tail-merge.c:452
> 0xd38f54 find_same_succ_bb
> gcc/tree-ssa-tail-merge.c:717
> 0xd39927 find_same_succ
> gcc/tree-ssa-tail-merge.c:748
> 0xd39927 init_worklist
> gcc/tree-ssa-tail-merge.c:767
> 0xd39927 tail_merge_optimize(unsigned int)
> gcc/tree-ssa-tail-merge.c:1726
> 0xce2d6a execute
> gcc/tree-ssa-pre.c:5164
> 
>

[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled

2017-03-23 Thread thopre01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155

--- Comment #12 from Thomas Preud'homme  ---
(In reply to Richard Biener from comment #9)
> Ah, the patches do not fix the testcase because the testcase is _not_ the
> PRE-creates-IV case.  It's indeed simply hoisting/PRE at work transforming
> 
>   # a_14 = PHI 
>   if (!b)
> a_8 = a_14 + 1;
> 
>   # a_2 = PHI 
>   a_10 = a_2 + 1;
>   ... = *(a_2 + 1);
> 
> to
> 
>   # a_14 = PHI 
>   _4 = a_14 + 1;
>   if (b)
> _3 = _4 + 1;
> 
>   # a_2 = PHI 
>   # prephitmp_12 = PHI <_4, _3>
>   ... = *(a_2 + 1);
> 
> increasing register pressure mainly because nothing figures that a_2 + 1
> in the dereference can be replaced by prephitmp_12 ...
> 
> So this is a missed SLSR opportunity or, in this simple form, a missed
> PRE/CSE opportunity.  Fixing that with the following (otherwise untested)
> restores good code generation for the testcase:
> 
> Index: gcc/tree-ssa-pre.c
> ===
> --- gcc/tree-ssa-pre.c  (revision 246414)
> +++ gcc/tree-ssa-pre.c  (working copy)
> @@ -4636,6 +4610,35 @@ eliminate_dom_walker::before_dom_childre
> }
> }
>  
> +  if (gimple_has_ops (stmt))
> +   for (unsigned i = 0; i < gimple_num_ops (stmt); ++i)
> + {
> +   tree op = gimple_op (stmt, i);
> +   if (op)
> + op = get_base_address (op);
> +   if (op
> +   && TREE_CODE (op) == MEM_REF
> +   && ! integer_zerop (TREE_OPERAND (op, 1)))
> + {
> +   tree ops[2];
> +   vn_nary_op_t nary;
> +   ops[0] = TREE_OPERAND (op, 0);
> +   ops[1] = TREE_OPERAND (op, 1);
> +   tree res = vn_nary_op_lookup_pieces (2, POINTER_PLUS_EXPR,
> +TREE_TYPE (ops[0]),
> +ops, );
> +   if (res && TREE_CODE (res) == SSA_NAME)
> + res = eliminate_avail (res);
> +   if (res)
> + {
> +   TREE_OPERAND (op, 0) = res;
> +   TREE_OPERAND (op, 1)
> + = build_int_cst (TREE_TYPE (TREE_OPERAND (op, 1)), 0);
> +   gimple_set_modified (stmt, true);
> + }
> + }
> + }
> +
>if (gimple_modified_p (stmt))
> {
>   /* If a formerly non-invariant ADDR_EXPR is turned into an
> 
> note that in general optimzing
> 
>q = p + 1;
>  = ...*(p + 1);
> 
> "back" to *q will be undone by forwprop/stmt folding later but in this
> case the feeding stmt is a PHI node and not a pointer-plus.  It still
> means that the change might be a bit too disruptive at this point
> (we could restricit it a bit by only handling the case where we don't
> replace with a pointer-plus result).

Thanks for your work on this! Sadly GCC ICEs with this patch:

0xd36f53 update_dep_bb
gcc/tree-ssa-tail-merge.c:411
0xd38f54 stmt_update_dep_bb
gcc/tree-ssa-tail-merge.c:429
0xd38f54 same_succ_hash
gcc/tree-ssa-tail-merge.c:452
0xd38f54 find_same_succ_bb
gcc/tree-ssa-tail-merge.c:717
0xd39927 find_same_succ
gcc/tree-ssa-tail-merge.c:748
0xd39927 init_worklist
gcc/tree-ssa-tail-merge.c:767
0xd39927 tail_merge_optimize(unsigned int)
gcc/tree-ssa-tail-merge.c:1726
0xce2d6a execute
gcc/tree-ssa-pre.c:5164

[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled

2017-03-23 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155

--- Comment #11 from Richard Biener  ---
Ok, the 2nd level opportunitues are important.  So the 2nd candidate remains
for the loop carried dependencies issue(s).  Regresses (on x86_64):

Running target unix/{,-m32}
FAIL: gcc.dg/store-motion-fgcse-sm.c scan-rtl-dump store_motion "STORE_MOTION
of f, .* basic blocks, 1 insns deleted, 1 insns created"
FAIL: gcc.dg/tree-ssa/loadpre10.c scan-tree-dump-times pre "Eliminated: 1" 1
FAIL: gcc.dg/tree-ssa/loadpre23.c scan-tree-dump-times pre "Eliminated: 1" 1
FAIL: gcc.dg/tree-ssa/loadpre24.c scan-tree-dump-times pre "Eliminated: 1" 1
FAIL: gcc.dg/tree-ssa/loadpre25.c scan-tree-dump-times pre "Eliminated: 1" 1
FAIL: gcc.dg/tree-ssa/loadpre4.c scan-tree-dump-times pre "Eliminated: 1" 1
FAIL: gcc.dg/tree-ssa/loadpre6.c scan-tree-dump-times pre "Eliminated: 1" 1
FAIL: gcc.dg/tree-ssa/pr21417.c scan-tree-dump-times thread4 "FSM jump thread"
1
FAIL: gcc.dg/tree-ssa/pr71347.c scan-tree-dump-not optimized ".* = MEM.*;"
FAIL: gcc.dg/tree-ssa/ssa-pre-23.c scan-tree-dump pre "Eliminated: 3"
FAIL: gcc.target/i386/pr49781-1.c scan-assembler-not lea[lq]?[
\\t]((%|)r[a-z0-9]*

most of them are expected and looking for the feature we disabled.  loadpre6,
pr21417.c and pr49781-1.c need a closer look.

[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled

2017-03-23 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155

--- Comment #10 from Richard Biener  ---
asm difference -fno-code-hoisting (-) against patch (+) shows

--- t.s 2017-03-23 13:04:11.010514228 +0100
+++ t.s.ok  2017-03-23 13:04:05.882432743 +0100
@@ -26,9 +26,9 @@
ldrbr3, [r4]@ zero_extendqisi2
cbz r3, .L2
 .L4:
-   bl  fn2
-   ldrbr3, [r4, #1]@ zero_extendqisi2
addsr4, r4, #1
+   bl  fn2
+   ldrbr3, [r4]@ zero_extendqisi2
cmp r3, #0
bne .L4
 .L2:

which probably is better as it only keeps r4 live over the call?  but
it adds a dependence between the load from r4 and the add.

[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled

2017-03-23 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155

--- Comment #9 from Richard Biener  ---
Ah, the patches do not fix the testcase because the testcase is _not_ the
PRE-creates-IV case.  It's indeed simply hoisting/PRE at work transforming

  # a_14 = PHI 
  if (!b)
a_8 = a_14 + 1;

  # a_2 = PHI 
  a_10 = a_2 + 1;
  ... = *(a_2 + 1);

to

  # a_14 = PHI 
  _4 = a_14 + 1;
  if (b)
_3 = _4 + 1;

  # a_2 = PHI 
  # prephitmp_12 = PHI <_4, _3>
  ... = *(a_2 + 1);

increasing register pressure mainly because nothing figures that a_2 + 1
in the dereference can be replaced by prephitmp_12 ...

So this is a missed SLSR opportunity or, in this simple form, a missed
PRE/CSE opportunity.  Fixing that with the following (otherwise untested)
restores good code generation for the testcase:

Index: gcc/tree-ssa-pre.c
===
--- gcc/tree-ssa-pre.c  (revision 246414)
+++ gcc/tree-ssa-pre.c  (working copy)
@@ -4636,6 +4610,35 @@ eliminate_dom_walker::before_dom_childre
}
}

+  if (gimple_has_ops (stmt))
+   for (unsigned i = 0; i < gimple_num_ops (stmt); ++i)
+ {
+   tree op = gimple_op (stmt, i);
+   if (op)
+ op = get_base_address (op);
+   if (op
+   && TREE_CODE (op) == MEM_REF
+   && ! integer_zerop (TREE_OPERAND (op, 1)))
+ {
+   tree ops[2];
+   vn_nary_op_t nary;
+   ops[0] = TREE_OPERAND (op, 0);
+   ops[1] = TREE_OPERAND (op, 1);
+   tree res = vn_nary_op_lookup_pieces (2, POINTER_PLUS_EXPR,
+TREE_TYPE (ops[0]),
+ops, );
+   if (res && TREE_CODE (res) == SSA_NAME)
+ res = eliminate_avail (res);
+   if (res)
+ {
+   TREE_OPERAND (op, 0) = res;
+   TREE_OPERAND (op, 1)
+ = build_int_cst (TREE_TYPE (TREE_OPERAND (op, 1)), 0);
+   gimple_set_modified (stmt, true);
+ }
+ }
+ }
+
   if (gimple_modified_p (stmt))
{
  /* If a formerly non-invariant ADDR_EXPR is turned into an

note that in general optimzing

   q = p + 1;
 = ...*(p + 1);

"back" to *q will be undone by forwprop/stmt folding later but in this
case the feeding stmt is a PHI node and not a pointer-plus.  It still
means that the change might be a bit too disruptive at this point
(we could restricit it a bit by only handling the case where we don't
replace with a pointer-plus result).

[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled

2017-03-23 Thread thopre01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155

--- Comment #8 from Thomas Preud'homme  ---
(In reply to rguent...@suse.de from comment #6 and #7)

>
> The testcase shows hardly profitable PRE which the patches should
> disable (I didn't verify the patches fix the testcase!)

The patch does not help for the testcase either (at least the Cortex-M7 one).

> 
> If you're benchmarking at -O2 can you add -fpredictive-commoning?  As 
> said, the patch removes important loop optimizations from PRE.

Sorry, it does not make any difference (either for the benchmark or the
testcase).

[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled

2017-03-23 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155

--- Comment #7 from rguenther at suse dot de  ---
On Thu, 23 Mar 2017, thopre01 at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155
> 
> --- Comment #5 from Thomas Preud'homme  ---
> (In reply to Richard Biener from comment #4)
> > Can you benchmark the 2nd candidate (most appropriate at this stage I think,
> > would also fix 77498).
> > 
> > It will cause pessimizations for code that benefits from predictive
> > commoning transforms that PRE can do at -O2 because predictive commoning
> > doesn't run at -O2.
> > The fix for that would be to run predictive commoning at -O2 but in a mode
> > that doesn't perform unrolling for example.  [even pcom can blow through
> > register pressure limits though]
> > 
> > Any other approach to fixing this particular bug cannot be done for GCC 7
> > (even this one is quite late ...).  So in the end I'd vote for WONTFIX for
> > GCC 7.
> 
> Sadly it doesn't. Well not quite true, there is a very slight improvement but
> we're far from the uplift -fno-code-hoisting provides.
> 
> There's maybe more than one bug at play and the reduced testcase only show 
> one.

If you're benchmarking at -O2 can you add -fpredictive-commoning?  As 
said, the patch removes important loop optimizations from PRE.

[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled

2017-03-23 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155

--- Comment #6 from rguenther at suse dot de  ---
On Thu, 23 Mar 2017, thopre01 at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155
> 
> --- Comment #5 from Thomas Preud'homme  ---
> (In reply to Richard Biener from comment #4)
> > Can you benchmark the 2nd candidate (most appropriate at this stage I think,
> > would also fix 77498).
> > 
> > It will cause pessimizations for code that benefits from predictive
> > commoning transforms that PRE can do at -O2 because predictive commoning
> > doesn't run at -O2.
> > The fix for that would be to run predictive commoning at -O2 but in a mode
> > that doesn't perform unrolling for example.  [even pcom can blow through
> > register pressure limits though]
> > 
> > Any other approach to fixing this particular bug cannot be done for GCC 7
> > (even this one is quite late ...).  So in the end I'd vote for WONTFIX for
> > GCC 7.
> 
> Sadly it doesn't. Well not quite true, there is a very slight improvement but
> we're far from the uplift -fno-code-hoisting provides.
> 
> There's maybe more than one bug at play and the reduced testcase only show 
> one.

The testcase shows hardly profitable PRE which the patches should
disable (I didn't verify the patches fix the testcase!)

code-hoisting (as well as CSE) can increase register pressure and neither
transform is register-pressure aware (and the patch doesn't try to
mitigate that).

[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled

2017-03-23 Thread thopre01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155

--- Comment #5 from Thomas Preud'homme  ---
(In reply to Richard Biener from comment #4)
> Can you benchmark the 2nd candidate (most appropriate at this stage I think,
> would also fix 77498).
> 
> It will cause pessimizations for code that benefits from predictive
> commoning transforms that PRE can do at -O2 because predictive commoning
> doesn't run at -O2.
> The fix for that would be to run predictive commoning at -O2 but in a mode
> that doesn't perform unrolling for example.  [even pcom can blow through
> register pressure limits though]
> 
> Any other approach to fixing this particular bug cannot be done for GCC 7
> (even this one is quite late ...).  So in the end I'd vote for WONTFIX for
> GCC 7.

Sadly it doesn't. Well not quite true, there is a very slight improvement but
we're far from the uplift -fno-code-hoisting provides.

There's maybe more than one bug at play and the reduced testcase only show one.

[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled

2017-03-23 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155

Richard Biener  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |rguenth at gcc dot 
gnu.org

--- Comment #4 from Richard Biener  ---
Can you benchmark the 2nd candidate (most appropriate at this stage I think,
would also fix 77498).

It will cause pessimizations for code that benefits from predictive commoning
transforms that PRE can do at -O2 because predictive commoning doesn't run at
-O2.
The fix for that would be to run predictive commoning at -O2 but in a mode
that doesn't perform unrolling for example.  [even pcom can blow through
register pressure limits though]

Any other approach to fixing this particular bug cannot be done for GCC 7 (even
this one is quite late ...).  So in the end I'd vote for WONTFIX for GCC 7.

[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled

2017-03-23 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155

--- Comment #3 from Richard Biener  ---
Created attachment 41031
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41031=edit
second patch candidate

Second candidate, instead of inhibiting PHI insertions this limits use of them
during elimination.  This should allow existing 2nd level opportunities to
continue to be handled.

[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled

2017-03-23 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155

--- Comment #2 from Richard Biener  ---
Created attachment 41030
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41030=edit
first patch candidate

First variant of an aggressive approach - never add IVs / loop carried
dependencies in PRE.  This first approach is leaner on compile-time /
memory-use
in PRE but eventually misses 2nd level optimizations by not performing PHI
insertions on loop headers.

[Bug tree-optimization/80155] [7 regression] Performance regression with code hoisting enabled

2017-03-23 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155

Richard Biener  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2017-03-23
   Target Milestone|--- |7.0
 Ever confirmed|0   |1

--- Comment #1 from Richard Biener  ---
Code-hoisting triggers a latent pessimization PRE performs:

Transform:

   [85.00%]:
  # a_14 = PHI 
  if (b_7(D) != 0)
goto ; [50.00%]
  else
goto ; [50.00%]

   [42.50%]:
  goto ; [100.00%]

   [42.50%]:
  a_8 = a_14 + 1;

   [85.00%]:
  # a_2 = PHI 
  fn2 ();
  a_10 = a_2 + 1;

to

   [85.00%]:
  # a_14 = PHI 
  _4 = a_14 + 1;
  if (b_7(D) != 0)
goto ; [50.00%]
  else
goto ; [50.00%]

   [42.50%]:
  _3 = _4 + 1;

   [85.00%]:
  # a_2 = PHI 
  # prephitmp_12 = PHI <_4(3), _3(4)>
  fn2 ();

that's because the hoisting (which itself isn't a problem) makes
a_2 + 1 partially redundant over the latch.  We see this issue
in related testcases where PRE can compute a constant for the
first iteration value of expressions and thus inserts IVs for
them.  So it's nothing new and a fix would hopefully fix those
cases as well.