[Bug target/29256] [4.9/5/6 regression] loop performance regression

2015-08-12 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256

--- Comment #62 from amker at gcc dot gnu.org ---
(In reply to Bill Schmidt from comment #61)
> (In reply to amker from comment #60)
> > (In reply to Bill Schmidt from comment #59)
> > > We don't have a lot of data yet, but we have seen several examples in SPEC
> > > and other benchmarks where turning on -funroll-loops is helpful, but 
> > > should
> > > be much more helpful -- in many cases performance improves with a much
> > > higher unroll factor.  However, the effectiveness of unrolling is very 
> > > much
> > > tied up with these issues in IVOPTS, where we currently end up with too 
> > > many
> > > separate base registers for IVs.  As we increase the unroll factor, we
> > By this, do you mean too many candidates are chosen?  Or the issue just like
> > this PR describes?  Thanks.
> > 
> 
> On the surface, it's the issue from this PR where we have lots of separate
> induction variables with their own index registers each requiring an add
> during each iteration.  The presence of this issue masks whether we have too
IMHO, this issue should be fixed by a gimple unroller before IVO, or in RTL
unroller.  It's not that practical to fix it in IVO.

> many candidates, but in the sense that we often see register spill
> associated with this kind of code, we do have too many.  I.e., the register
> pressure model may not be in tune with the kind of addressing mode that's
> being selected, but that's just a theory.  Or perhaps pressure is just being
> generically under-predicted for POWER.
IVO's reg-pressure model fails to preserve a small iv set sometime on aarch64
too.  I have this issue on list.  On the other hand, the loops I saw are
generally very big, it's might be inappropriate that rtl unroller decides to
unroll them at the first place.

> 
> Up till now we haven't done a lot of detailed analysis.  Hopefully we can
> free somebody up to start looking at some of our unrolling issues soon.


[Bug target/29256] [4.9/5/6 regression] loop performance regression

2015-08-12 Thread wschmidt at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256

--- Comment #61 from Bill Schmidt  ---
(In reply to amker from comment #60)
> (In reply to Bill Schmidt from comment #59)
> > We don't have a lot of data yet, but we have seen several examples in SPEC
> > and other benchmarks where turning on -funroll-loops is helpful, but should
> > be much more helpful -- in many cases performance improves with a much
> > higher unroll factor.  However, the effectiveness of unrolling is very much
> > tied up with these issues in IVOPTS, where we currently end up with too many
> > separate base registers for IVs.  As we increase the unroll factor, we
> By this, do you mean too many candidates are chosen?  Or the issue just like
> this PR describes?  Thanks.
> 

On the surface, it's the issue from this PR where we have lots of separate
induction variables with their own index registers each requiring an add during
each iteration.  The presence of this issue masks whether we have too many
candidates, but in the sense that we often see register spill associated with
this kind of code, we do have too many.  I.e., the register pressure model may
not be in tune with the kind of addressing mode that's being selected, but
that's just a theory.  Or perhaps pressure is just being generically
under-predicted for POWER.

Up till now we haven't done a lot of detailed analysis.  Hopefully we can free
somebody up to start looking at some of our unrolling issues soon.


[Bug target/29256] [4.9/5/6 regression] loop performance regression

2015-08-12 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256

--- Comment #60 from amker at gcc dot gnu.org ---
(In reply to Bill Schmidt from comment #59)
> (In reply to rguent...@suse.de from comment #57)
> > 
> > It's been a long time since I've done SPEC measuring with/without
> > -funroll-loops (or/and -fpeel-loops).  Note that these flags have
> > secondary effects as well:
> > 
> > toplev.c:flag_web = flag_unroll_loops || flag_peel_loops;
> > toplev.c:flag_rename_registers = flag_unroll_loops || flag_peel_loops;
> 
> We don't have a lot of data yet, but we have seen several examples in SPEC
> and other benchmarks where turning on -funroll-loops is helpful, but should
> be much more helpful -- in many cases performance improves with a much
> higher unroll factor.  However, the effectiveness of unrolling is very much
> tied up with these issues in IVOPTS, where we currently end up with too many
> separate base registers for IVs.  As we increase the unroll factor, we
By this, do you mean too many candidates are chosen?  Or the issue just like
this PR describes?  Thanks.

> eventually hit this as a limiting factor, so fixing this IVOPTS issue would
> be very helpful for POWER.
> 
> As a side note, with -fprofile-use a GIMPLE unroller could peel and unroll
> hot loop traces in loops that would otherwise be too complex to unroll. 
> I.e., if there is a single hot trace through a loop, you can do tail
> duplication on the trace to force it into superblock form, and then peel and
> unroll that superblock while falling into the original loop if the trace is
> left.  Complete unrolling and unrolling by a factor are both possible.  I
> don't know of specific benchmarks that would be helped by this, though.
> 
> (An RTL unroller could do this as well, but it seems much more natural and
> implementable in GIMPLE.)


[Bug target/29256] [4.9/5/6 regression] loop performance regression

2015-08-12 Thread wschmidt at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256

--- Comment #59 from Bill Schmidt  ---
(In reply to rguent...@suse.de from comment #57)
> 
> It's been a long time since I've done SPEC measuring with/without
> -funroll-loops (or/and -fpeel-loops).  Note that these flags have
> secondary effects as well:
> 
> toplev.c:flag_web = flag_unroll_loops || flag_peel_loops;
> toplev.c:flag_rename_registers = flag_unroll_loops || flag_peel_loops;

We don't have a lot of data yet, but we have seen several examples in SPEC and
other benchmarks where turning on -funroll-loops is helpful, but should be much
more helpful -- in many cases performance improves with a much higher unroll
factor.  However, the effectiveness of unrolling is very much tied up with
these issues in IVOPTS, where we currently end up with too many separate base
registers for IVs.  As we increase the unroll factor, we eventually hit this as
a limiting factor, so fixing this IVOPTS issue would be very helpful for POWER.

As a side note, with -fprofile-use a GIMPLE unroller could peel and unroll hot
loop traces in loops that would otherwise be too complex to unroll.  I.e., if
there is a single hot trace through a loop, you can do tail duplication on the
trace to force it into superblock form, and then peel and unroll that
superblock while falling into the original loop if the trace is left.  Complete
unrolling and unrolling by a factor are both possible.  I don't know of
specific benchmarks that would be helped by this, though.

(An RTL unroller could do this as well, but it seems much more natural and
implementable in GIMPLE.)


[Bug target/29256] [4.9/5/6 regression] loop performance regression

2015-08-12 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256

--- Comment #58 from amker at gcc dot gnu.org ---
(In reply to Bill Schmidt from comment #56)
> (In reply to Bill Schmidt from comment #53)
> > I'm not a fan of a tree-level unroller.  It's impossible to make good
> > decisions about unroll factors that early.  But your second approach sounds
> > quite promising to me.
> 
> I would be willing to soften this statement.  I think that an early unroller
> might well be a profitable approach for most systems with large caches and
> so forth, where if the unrolling heuristics are not completely accurate we
> are still likely to make a reasonably good decision.  However, I would
> expect to see ports with limited caches/memory to want more accurate control
> over unrolling decisions.  So I could see allowing ports to select between a
> GIMPLE unroller and an RTL unroller (I doubt anybody would want both).

Thanks for the comments.
As David suggested, we can try to implement a relatively conservative unroller
and make sure it's a win in most unrolled cases, even with some opportunities
missed.  Then we can enable it at O3/Ofast level, that would be wanted I think
since now we don't have a general unroller by default.

> 
> In general it seems like PowerPC could benefit from more aggressive
> unrolling much of the time, provided we can also solve the related IVOPTS
> problems that cause too much register spill.
> 
> I may have an interest in working on a GIMPLE unroller, depending on how
> quickly I can complete or shed some other projects...

(In reply to rguent...@suse.de from comment #57)
> On Tue, 11 Aug 2015, wschmidt at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256
> > 
> > --- Comment #56 from Bill Schmidt  ---
> > (In reply to Bill Schmidt from comment #53)
> > > I'm not a fan of a tree-level unroller.  It's impossible to make good
> > > decisions about unroll factors that early.  But your second approach 
> > > sounds
> > > quite promising to me.
> > 
> > I would be willing to soften this statement.  I think that an early unroller
> > might well be a profitable approach for most systems with large caches and 
> > so
> > forth, where if the unrolling heuristics are not completely accurate we are
> > still likely to make a reasonably good decision.  However, I would expect to
> > see ports with limited caches/memory to want more accurate control over
> > unrolling decisions.  So I could see allowing ports to select between a 
> > GIMPLE
> > unroller and an RTL unroller (I doubt anybody would want both).
> > 
> > In general it seems like PowerPC could benefit from more aggressive 
> > unrolling
> > much of the time, provided we can also solve the related IVOPTS problems 
> > that
> > cause too much register spill.
> > 
> > I may have an interest in working on a GIMPLE unroller, depending on how
> > quickly I can complete or shed some other projects...
> 
> I think that a separate unrolling on GIMPLE would be a hard sell
> due to the lack of a good cost mode.  _But_ doing unrolling as part
> of another transform like we are doing now makes sense.  So does
> eventually moving parts of an RTL pass involving unrolling to
> GIMPLE, like modulo scheduling or SMS (leaving the scheduling part
> to RTL).
(In reply to Bill Schmidt from comment #56)
> (In reply to Bill Schmidt from comment #53)
> > I'm not a fan of a tree-level unroller.  It's impossible to make good
> > decisions about unroll factors that early.  But your second approach sounds
> > quite promising to me.
> 
> I would be willing to soften this statement.  I think that an early unroller
> might well be a profitable approach for most systems with large caches and
> so forth, where if the unrolling heuristics are not completely accurate we
> are still likely to make a reasonably good decision.  However, I would
> expect to see ports with limited caches/memory to want more accurate control
> over unrolling decisions.  So I could see allowing ports to select between a
> GIMPLE unroller and an RTL unroller (I doubt anybody would want both).

As David suggested, we can try to implement a relatively conservative unroller
and make sure it's a win in most unrolled cases, even with some opportunities
missed.  Then we can enable it at O3/Ofast level, it would be nice since we
don't have a general unroller by default.

About cost-model.  Is it possible to introduce cache information model in GCC? 
I don't see it's a difficult problem, and can be a start for possible cache
sensitive optimizations in the future?  Another general question is: what kind
of cost do we need in a fine unroller, besides cache/branch ones?

> 
> In general it seems like PowerPC could benefit from more aggressive
> unrolling much of the time, provided we can also solve the related IVOPTS
> problems that cause too much register spill.
> 
> I may have an interest in working on a GIMPLE unroller, depending on how
> quickly I can complete or shed some other pro

[Bug target/29256] [4.9/5/6 regression] loop performance regression

2015-08-12 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256

--- Comment #57 from rguenther at suse dot de  ---
On Tue, 11 Aug 2015, wschmidt at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256
> 
> --- Comment #56 from Bill Schmidt  ---
> (In reply to Bill Schmidt from comment #53)
> > I'm not a fan of a tree-level unroller.  It's impossible to make good
> > decisions about unroll factors that early.  But your second approach sounds
> > quite promising to me.
> 
> I would be willing to soften this statement.  I think that an early unroller
> might well be a profitable approach for most systems with large caches and so
> forth, where if the unrolling heuristics are not completely accurate we are
> still likely to make a reasonably good decision.  However, I would expect to
> see ports with limited caches/memory to want more accurate control over
> unrolling decisions.  So I could see allowing ports to select between a GIMPLE
> unroller and an RTL unroller (I doubt anybody would want both).
> 
> In general it seems like PowerPC could benefit from more aggressive unrolling
> much of the time, provided we can also solve the related IVOPTS problems that
> cause too much register spill.
> 
> I may have an interest in working on a GIMPLE unroller, depending on how
> quickly I can complete or shed some other projects...

I think that a separate unrolling on GIMPLE would be a hard sell
due to the lack of a good cost mode.  _But_ doing unrolling as part
of another transform like we are doing now makes sense.  So does
eventually moving parts of an RTL pass involving unrolling to
GIMPLE, like modulo scheduling or SMS (leaving the scheduling part
to RTL).

Note that the RTL unroller is not enabled by default by any optimization
level and note that unfortunately the RTL unroller shares flags with
the GIMPLE level complete peeling (where it mainly controls cost 
modeling).  Oh, but it's enabled with -fprofile-use.

It's been a long time since I've done SPEC measuring with/without
-funroll-loops (or/and -fpeel-loops).  Note that these flags have
secondary effects as well:

toplev.c:flag_web = flag_unroll_loops || flag_peel_loops;
toplev.c:flag_rename_registers = flag_unroll_loops || flag_peel_loops;


[Bug target/29256] [4.9/5/6 regression] loop performance regression

2015-08-11 Thread wschmidt at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256

--- Comment #56 from Bill Schmidt  ---
(In reply to Bill Schmidt from comment #53)
> I'm not a fan of a tree-level unroller.  It's impossible to make good
> decisions about unroll factors that early.  But your second approach sounds
> quite promising to me.

I would be willing to soften this statement.  I think that an early unroller
might well be a profitable approach for most systems with large caches and so
forth, where if the unrolling heuristics are not completely accurate we are
still likely to make a reasonably good decision.  However, I would expect to
see ports with limited caches/memory to want more accurate control over
unrolling decisions.  So I could see allowing ports to select between a GIMPLE
unroller and an RTL unroller (I doubt anybody would want both).

In general it seems like PowerPC could benefit from more aggressive unrolling
much of the time, provided we can also solve the related IVOPTS problems that
cause too much register spill.

I may have an interest in working on a GIMPLE unroller, depending on how
quickly I can complete or shed some other projects...


[Bug target/29256] [4.9/5/6 regression] loop performance regression

2015-06-26 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256

Jakub Jelinek  changed:

   What|Removed |Added

   Target Milestone|4.9.3   |4.9.4


[Bug target/29256] [4.9/5/6 regression] loop performance regression

2015-06-26 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256

--- Comment #55 from Jakub Jelinek  ---
GCC 4.9.3 has been released.