Re: [Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies

2022-01-11 Thread Jan Hubicka via Gcc-bugs
on zen2 and 3 with -flto the speedup seems to be cca 12% for both -O2
and -Ofast -march=native which is both very nice!
Zen1 for some reason sees less improvement, about 6%.
With PGO it is 3.8%

Overall it seems a win, but there are few noteworthy issues.

I also see a 6.69% regression on x64 with -Ofast -march=native -flto
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=475.377.0
and perhaps 3-5% on sphinx
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=476.280.0
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=227.280.0

For non-spec benchmarks spec there is a regression on nbench
https://lnt.opensuse.org/db_default/v4/CPP/graph?plot.0=26.645.1
There are also large changes in tsvc
https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report
it may be noise since kernels are tiny, but for example x293 reproduces
both on kabylake and zen by about 80-90% regression that may be easy to
track (the kernel is included in the testsuite). Same regression is not
seen on zen3, so may be an ISA specific or so.

FInally there seems relatively large code size savings on polyhedron
benchmarks today (8% on capacita, 

Thanks a lot!


[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies

2022-01-10 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #40 from CVS Commits  ---
The master branch has been updated by Richard Sandiford :

https://gcc.gnu.org/g:037cc0b4a6646cc86549247a3590215ebd5c4c43

commit r12-6416-g037cc0b4a6646cc86549247a3590215ebd5c4c43
Author: Richard Sandiford 
Date:   Mon Jan 10 14:47:09 2022 +

ira: Handle "soft" conflicts between cap and non-cap allocnos

This patch looks for allocno conflicts of the following form:

- One allocno (X) is a cap allocno for some non-cap allocno X2.
- X2 belongs to some loop L2.
- The other allocno (Y) is a non-cap allocno.
- Y is an ancestor of some allocno Y2 in L2.
- Y2 is not referenced in L2 (that is, ALLOCNO_NREFS (Y2) == 0).
- Y can use a different allocation from Y2.

In this case, Y's register is live across L2 but is not used within it,
whereas X's register is used only within L2.  The conflict is therefore
only "soft", in that it can easily be avoided by spilling Y2 inside L2
without affecting any insn references.

In principle we could do this for ALLOCNO_NREFS (Y2) != 0 too, with the
callers then taking Y2's ALLOCNO_MEMORY_COST into account.  There would
then be no "cliff edge" between a Y2 that has no references and a Y2 that
has (say) a single cold reference.

However, doing that isn't necessary for the PR and seems to give
variable results in practice.  (fotonik3d_r improves slightly but
namd_r regresses slightly.)  It therefore seemed better to start
with the higher-value zero-reference case and see how things go.

On top of the previous patches in the series, this fixes the exchange2
regression seen in GCC 11.

gcc/
PR rtl-optimization/98782
* ira-int.h (ira_soft_conflict): Declare.
* ira-color.c (max_soft_conflict_loop_depth): New constant.
(ira_soft_conflict): New function.
(spill_soft_conflicts): Likewise.
(assign_hard_reg): Use them to handle the case described by
the comment above ira_soft_conflict.
(improve_allocation): Likewise.
* ira.c (check_allocation): Allow allocnos with "soft" conflicts
to share the same register.

gcc/testsuite/
* gcc.target/aarch64/reg-alloc-4.c: New test.

[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies

2022-01-10 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #39 from CVS Commits  ---
The master branch has been updated by Richard Sandiford :

https://gcc.gnu.org/g:01f3e6a40e7202310abbeb41c345d325bd69554f

commit r12-6415-g01f3e6a40e7202310abbeb41c345d325bd69554f
Author: Richard Sandiford 
Date:   Mon Jan 10 14:47:08 2022 +

ira: Consider modelling caller-save allocations as loop spills

If an allocno A in an inner loop L spans a call, a parent allocno AP
can choose to handle a call-clobbered/caller-saved hard register R
in one of two ways:

(1) save R before each call in L and restore R after each call
(2) spill R to memory throughout L

(2) can be cheaper than (1) in some cases, particularly if L does
not reference A.

Before the patch we always did (1).  The patch adds support for
picking (2) instead, when it seems cheaper.  It builds on the
earlier support for not propagating conflicts to parent allocnos.

gcc/
PR rtl-optimization/98782
* ira-int.h (ira_caller_save_cost): New function.
(ira_caller_save_loop_spill_p): Likewise.
* ira-build.c (ira_propagate_hard_reg_costs): Test whether it is
cheaper to spill a call-clobbered register throughout a loop rather
than spill it around each individual call.  If so, treat all
call-clobbered registers as conflicts and...
(propagate_allocno_info): ...do not propagate call information
from the child to the parent.
* ira-color.c (move_spill_restore): Update accordingly.
* ira-costs.c (ira_tune_allocno_costs): Use ira_caller_save_cost.

gcc/testsuite/
* gcc.target/aarch64/reg-alloc-3.c: New test.

[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies

2022-01-10 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #38 from CVS Commits  ---
The master branch has been updated by Richard Sandiford :

https://gcc.gnu.org/g:d54565d87ff79b882208dfb29af50232033c233d

commit r12-6413-gd54565d87ff79b882208dfb29af50232033c233d
Author: Richard Sandiford 
Date:   Mon Jan 10 14:47:07 2022 +

ira: Add ira_subloop_allocnos_can_differ_p

color_pass has two instances of the same code for propagating non-cap
assignments from parent loops to subloops.  This patch adds a helper
function for testing when such propagations are required for correctness
and uses it to remove the duplicated code.

A later patch will use this in ira-build.c too, which is why the
function is exported to ira-int.h.

No functional change intended.

gcc/
PR rtl-optimization/98782
* ira-int.h (ira_subloop_allocnos_can_differ_p): New function,
extracted from...
* ira-color.c (color_pass): ...here.

[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies

2022-01-10 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #37 from CVS Commits  ---
The master branch has been updated by Richard Sandiford :

https://gcc.gnu.org/g:909a4b4764c4f270f09ccb2a950c91b21ed7b33a

commit r12-6412-g909a4b4764c4f270f09ccb2a950c91b21ed7b33a
Author: Richard Sandiford 
Date:   Mon Jan 10 14:47:07 2022 +

ira: Add comments and fix move_spill_restore calculation

This patch adds comments to describe each use of ira_loop_border_costs.
I think this highlights that move_spill_restore was using the wrong cost
in one case, which came from tranposing [0] and [1] in the original
(pre-ira_loop_border_costs) ira_memory_move_cost expressions.  The
difference would only be noticeable on targets that distinguish between
load and store costs.

gcc/
PR rtl-optimization/98782
* ira-color.c (color_pass): Add comments to describe the spill
costs.
(move_spill_restore): Likewise.  Fix reversed calculation.

[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies

2022-01-09 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #36 from Hongtao.liu  ---
(In reply to pthaugen from comment #33)
> (In reply to rsand...@gcc.gnu.org from comment #32)
> > Created attachment 52102 [details]
> > Alternative patch
> > 
> > This patch is a squash of several ira tweaks that together recover the
> > pre-GCC11 exchange2 performance on aarch64.  It isn't ready for trunk
> > yet (hence lack of comments and changelog).  It would be great to hear
> > whether/how it works on other targets though.
> 
> I tried the patch on a Power9 system. Execution time went from 371 seconds
> to 291.

The patch also recover 548.exchange2_r performance on ICX, thanks.

[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies

2022-01-06 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

URL||https://gcc.gnu.org/piperma
   ||il/gcc-patches/2022-January
   ||/587761.html

--- Comment #35 from rsandifo at gcc dot gnu.org  
---
(In reply to pthaugen from comment #33)
> I tried the patch on a Power9 system. Execution time went from 371 seconds
> to 291.
Thanks Pat!  Given that it seems to work OK on aarch64 and Power,
I've submitted a fixed version of the series for review:
https://gcc.gnu.org/pipermail/gcc-patches/2022-January/587761.html

[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies

2022-01-04 Thread pthaugen at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #34 from pthaugen at gcc dot gnu.org ---
(In reply to pthaugen from comment #33)
> 
> I tried the patch on a Power9 system. Execution time went from 371 seconds
> to 291.

Which I should have included is in line, or even slightly better, than the 2
patches posted by Tamar.

[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies

2022-01-04 Thread pthaugen at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #33 from pthaugen at gcc dot gnu.org ---
(In reply to rsand...@gcc.gnu.org from comment #32)
> Created attachment 52102 [details]
> Alternative patch
> 
> This patch is a squash of several ira tweaks that together recover the
> pre-GCC11 exchange2 performance on aarch64.  It isn't ready for trunk
> yet (hence lack of comments and changelog).  It would be great to hear
> whether/how it works on other targets though.

I tried the patch on a Power9 system. Execution time went from 371 seconds to
291.

[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies

2021-12-31 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org

--- Comment #32 from rsandifo at gcc dot gnu.org  
---
Created attachment 52102
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52102=edit
Alternative patch

This patch is a squash of several ira tweaks that together recover the
pre-GCC11 exchange2 performance on aarch64.  It isn't ready for trunk
yet (hence lack of comments and changelog).  It would be great to hear
whether/how it works on other targets though.

The patch bootstraps on aarch64-linux-gnu and x86_64-linux-gnu,
but there are some new scan-assembler failures that need looking at.

Quoting from the covering message:

The main changes are:

(1) Add ira_loop_border_costs to simplify spill cost calculations
(NFC intended)

(2) Avoid freeing updated costs until the loop node has been fully
allocated.  This in turn allows:

(3) Make improve_allocation work exclusively on updated costs,
rather than using a mixture of updated and original costs.
One reason this matters is that the register costs only make
sense relative to the memory costs, so in some cases,
a common register is subtracted from the updated memory cost
instead of being added to each individual updated register cost.

(4) If a child allocno has a hard register conflict, allow the parent
allocno to handle the conflict by spilling to memory throughout
the child allocno's loop.  This carries the child allocno's full
memory cost plus the cost of spilling to memory on entry to the
loop and restoring it on exit, but this can still be cheaper
than spilling the entire parent allocno.  In particular, it helps
for allocnos that are live across a loop but not referenced
within it, since the child allocno's memory cost is 0 in
that case.

(5) Extend (4) to cases in which the child allocno is live across
a call.  The parent then has a free choice between spilling
call-clobbered registers around each call (as normal) or
spilling them on entry to the loop, keeping the allocno in memory
throughout the loop, and restoring them on exit from the loop.

(6) Detect <80><9C>soft conflicts<80><9D> in which:

- one allocno (A1) is a cap whose (transitive) <80><9C>real<80><9D>
allocno
  is A1'

- A1' occurs in loop L1'

- the other allocno (A2) is a non-cap allocno

- the equivalent of A2 is live across L1' (hence the conflict)
  but has no references in L1'

In this case we can spill A2 around L1' (or perhaps some parent
loop) and reuse the same register for A1'.  A1 and A2 can then
use the same hard register, provided that we make sure not to
propagate A1's allocation to A1'.

[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies

2021-12-20 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #31 from rsandifo at gcc dot gnu.org  
---
I don't know how relevant this is to the exchange2 problem yet,
but for:

-
#define PROB 0.95

struct L {
  int data;
  L *next;
  L *inner;
};

template
struct S {
  static __attribute__((always_inline)) void f(L *head, int inc) {
while (head) {
  asm volatile ("// Loop %0" :: "i" (N));
  int subinc = head->data + inc;
  if (__builtin_expect_with_probability (bool(head->inner), 0, PROB))
S::f(head->inner, subinc);
  head->data = subinc;
  head = head->inner;
}
  }
};

template<>
struct S<0> {
  static void f(L *, int) {
asm volatile ("// foo" ::: "x0", "x1", "x2", "x3", "x4", "x5", "x6", "x7");
  }
};

void x(L *head) { S<10>::f(head, 1); }
-

compiled on aarch64 with -O3, we always seem to spill in the outer
loops regardless of the value of PROB.  That is, the inner loops
always seem to get priority even if their bb frequencies are low.
I would have expected that moving PROB from 0.05 (inner loop very
likely to be executed) to 0.95 (inner loop very likely to be skipped)
would have moved the spills from the outer loops to the inner loops.

An unrelated issue on aarch64 is that IP0 and IP1 are explicitly
clobbered by call instructions, via CALL_INSN_FUNCTION_USAGE.
We have to do this because the linker is allowed to insert code
that clobbers those registers even if we “know” that the target
of the call doesn't clobber them.  However, clobbering the registers
that way prevents them from being used for registers that are live
across the call, even if the call is very unlikely to be executed.
Hacking a fix for that does reduce the number of spills, but doesn't
get rid of the ones that matter in exchange2.

[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies

2021-12-15 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #30 from Tamar Christina  ---
(In reply to Martin Jambor from comment #29)
> (In reply to Tamar Christina from comment #23)
> > I wonder if we can get rid of the final magic parameter too, we run with
> > --param ipa-cp-unit-growth=80 too which seems to have no more effect on
> > exchange, though still a slight effect on leela but a 12% gain in imagick.
> > 
> > This looks like without the parameter we lose constprop on MorphologyApply
> > resulting in a much bigger function.  Do you want me to make a new ticket
> > for this?
> 
> Indeed I did not know about this.  I think tracking it in a (separate) PR
> would make sense.  Thanks a lot for the pointer!

Thanks, created a new ticket
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103734 the threshold seems to be a
lot lower than 80%, 20% seems to already be enough.

[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies

2021-12-14 Thread jamborm at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #29 from Martin Jambor  ---
(In reply to Tamar Christina from comment #23)
> I wonder if we can get rid of the final magic parameter too, we run with
> --param ipa-cp-unit-growth=80 too which seems to have no more effect on
> exchange, though still a slight effect on leela but a 12% gain in imagick.
> 
> This looks like without the parameter we lose constprop on MorphologyApply
> resulting in a much bigger function.  Do you want me to make a new ticket
> for this?

Indeed I did not know about this.  I think tracking it in a (separate) PR would
make sense.  Thanks a lot for the pointer!

[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies

2021-12-14 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #28 from Tamar Christina  ---
(In reply to pthaugen from comment #19)
> I tried -fno-inline-functions-called-once and the patches on a Power9
> system. Following are the run times and spill counts (grep -c Spilling
> exchange2.fppized.f90.298r.ira). Interesting that the spill counts increase
> when -fno-inline-functions-called-once is added, but obviously that
> additional spill is on cold paths since execution time improves. Compiler
> options used are "-O3 -mcpu=power9 -fpeel-loops -funroll-loops -ffast-math".
> 
>   time(sec)  Spill
> base473   3284
> 
> no-inline-functions-called-once 370   5410
> 
> patches 1 & 2   397461
> 
> patches 1 & 2
> + no-inline-functions-called-once   299870

Ah thanks for testing!, so at least we know that all architectures are
suffering from the same underlying issue :)

[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies

2021-12-14 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #27 from Tamar Christina  ---
(In reply to hubicka from comment #26)
> > It's with LTO, I'll see if non-LTO has the same benefit.  In terms of 
> > code-size
> > it looks like it accounts for a 20% increase for binary size, but the hot
> > function shrinks approx 6x. But of course this increase covers the entire
> > binaries and there are other hot functions like GetVirtualPixelsFromNexus 
> > that
> > also shrink significantly when specialized.
>
> Thanks.  I will check this.  We mey need tweaking cost model to first
> specialize the functions that reduce more (by doing multiple passes with
> decreasing threshold)

Thanks! It looks like it doesn't need LTO, just plain -Ofast is enough to get
the gain.

> 
> It is really nice to see some real improvmeents from ipa-cp. For years
> the pass did almost nothing on the benchmarks I tested (spec, Firefox,
> clang) but it has changed.  I also see measurable speedups on clang
> binary with ipa-cp on.

Indeed! Some of the gains from these cp's are quite nice!

[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies

2021-12-14 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #26 from hubicka at kam dot mff.cuni.cz ---
> It's with LTO, I'll see if non-LTO has the same benefit.  In terms of 
> code-size
> it looks like it accounts for a 20% increase for binary size, but the hot
> function shrinks approx 6x. But of course this increase covers the entire
> binaries and there are other hot functions like GetVirtualPixelsFromNexus that
> also shrink significantly when specialized.
Thanks.  I will check this.  We mey need tweaking cost model to first
specialize the functions that reduce more (by doing multiple passes with
decreasing threshold)

It is really nice to see some real improvmeents from ipa-cp. For years
the pass did almost nothing on the benchmarks I tested (spec, Firefox,
clang) but it has changed.  I also see measurable speedups on clang
binary with ipa-cp on.

[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies

2021-12-14 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #25 from Tamar Christina  ---
(In reply to hubicka from comment #24)
> > Awesome! thanks!
> > 
> > I wonder if we can get rid of the final magic parameter too, we run with
> > --param ipa-cp-unit-growth=80 too which seems to have no more effect on
> > exchange, though still a slight effect on leela but a 12% gain in imagick.
> 
> Interesting - this is Martin Jambor's area but I do not think we was
> aware of this.  I wonder how bug growth is really needed - 80% is quite
> a lot especially if imagemagick is already big binary.  Is this about
> LTO or non-LTO build?

It's with LTO, I'll see if non-LTO has the same benefit.  In terms of code-size
it looks like it accounts for a 20% increase for binary size, but the hot
function shrinks approx 6x. But of course this increase covers the entire
binaries and there are other hot functions like GetVirtualPixelsFromNexus that
also shrink significantly when specialized.

[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies

2021-12-14 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #24 from hubicka at kam dot mff.cuni.cz ---
> Awesome! thanks!
> 
> I wonder if we can get rid of the final magic parameter too, we run with
> --param ipa-cp-unit-growth=80 too which seems to have no more effect on
> exchange, though still a slight effect on leela but a 12% gain in imagick.

Interesting - this is Martin Jambor's area but I do not think we was
aware of this.  I wonder how bug growth is really needed - 80% is quite
a lot especially if imagemagick is already big binary.  Is this about
LTO or non-LTO build?

[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies

2021-12-14 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #23 from Tamar Christina  ---
(In reply to Jan Hubicka from comment #20)
> With g:r12-5872-gf157c5362b4844f7676cae2aba81a4cf75bd68d5 we should no
> longer need -fno-inline-functions-called-once

Awesome! thanks!

I wonder if we can get rid of the final magic parameter too, we run with
--param ipa-cp-unit-growth=80 too which seems to have no more effect on
exchange, though still a slight effect on leela but a 12% gain in imagick.

This looks like without the parameter we lose constprop on MorphologyApply
resulting in a much bigger function.  Do you want me to make a new ticket for
this?

[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies

2021-12-10 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #22 from Jan Hubicka  ---
We discussed with Richard the option to biass the IRA spilling heuristics to
spill into blocks containing calls (since they are slow anyway) bypassing some
of the guessed frequencies.

Thinking on it, perhaps I should note that I believe that even with profile
feedback we seem to have IRA problem (it used to be possible to get faster code
w/o -fprofile-use then with), so the default heuristics seems to be somehow
misbehaving. With current guessed profile the frequency of BB containing the
recursive call is about 0.9, while reality is about 0.99 which is quite close.

[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies

2021-12-09 Thread pthaugen at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #21 from pthaugen at gcc dot gnu.org ---
(In reply to Jan Hubicka from comment #20)
> With g:r12-5872-gf157c5362b4844f7676cae2aba81a4cf75bd68d5 we should no
> longer need -fno-inline-functions-called-once

Yes, I see that now with an updated trunk.

[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies

2021-12-09 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #20 from Jan Hubicka  ---
With g:r12-5872-gf157c5362b4844f7676cae2aba81a4cf75bd68d5 we should no longer
need -fno-inline-functions-called-once

[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies

2021-12-09 Thread pthaugen at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #19 from pthaugen at gcc dot gnu.org ---
I tried -fno-inline-functions-called-once and the patches on a Power9 system.
Following are the run times and spill counts (grep -c Spilling
exchange2.fppized.f90.298r.ira). Interesting that the spill counts increase
when -fno-inline-functions-called-once is added, but obviously that additional
spill is on cold paths since execution time improves. Compiler options used are
"-O3 -mcpu=power9 -fpeel-loops -funroll-loops -ffast-math".

  time(sec)  Spill
base473   3284

no-inline-functions-called-once 370   5410

patches 1 & 2   397461

patches 1 & 2
+ no-inline-functions-called-once   299870

[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies

2021-12-08 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #18 from rsandifo at gcc dot gnu.org  
---
(In reply to Tamar Christina from comment #17)
> > On “CALL_FREQ grows much quicker than BB_FREQ”: for r104, the
> > ALLOCNO_FREQ ought in principle to be fixed for a given loop iteration
> > count.  It shouldn't grow or shrink based on the value of SPILLED.
> > That's because every execution of the loop body involves exactly one
> > reference to r104.  SPILLED specifies the probability that that single
> > reference is the “call” use rather than the “non-call” use, but it doesn't
> > change the total number of references per iteration.
> > 
> > So I think the only reason we see the different ALLOCNO_FREQs in:
> > 
> >ALLOCNO_FREQ 989, …
> > 
> > vs:
> > 
> >ALLOCNO_FREQ 990, …
> > 
> > is round-off error.  If the values had more precision, I think we'd
> > have a fixed ALLOCNO_FREQ and a varying ALLOCNO_CALL_FREQ.
> 
> yeah, that's plausible, as far as I can tell the FREQ are always scaled by
> REG_FREQ_FROM_EDGE_FREQ into [0, BB_FREQ_MAX] and that indeed does an
> integer division.  The general problem is that the IPA frequences don't
> really seem to have any bounded range and so it always needs to scale.
> 
> So I think you're always going to have this error one way or another
> which may or may not work to your advantage on any given program.
> 
> Maybe we need a way to be a bit more tolerant of this rounding error
> instead?
I guess the thing is: with the costs that the target is giving IRA,
there isn't much of a difference between the cost of the “0.5 way”
and the cost of the “0.51 way” for values of SPILLED around the
0.5 mark.

For spilling r104 we have:

  FactorCost type  What
  ===  
  1 store  store outside the loop when R is set
  1024×SPILLED  load   load inside the loop when R is used (call)
  1024×(1-SPILLED)  load   load inside the loop when R is used (non-call)

For allocating a call-clobbered register to r104 we have:

  FactorCost type  What
  ===  
  1024×SPILLED  store  store before the call
  1024×SPILLED  load   load after the call

When loads and store have equal weight, that weight cancels out,
giving:

1025  vs.  2048×SPILLED

So the costs are equal for SPILLED=1025/2048.  Above that spilling
is cheaper, below that allocating a call-clobbered register is
cheaper.

IRA has done well to find the exact tipping point (based on the
assumptions it has been told to use), but for values of SPILLED
around the tipping point, the two costs are very close.

So with the current cost model, we're not falling off a cliff
cost-wise when we hit the tipping point or rounding error.
The costs coming in give the impression that there isn't much
to chose between the two approaches when SPILLED is roughly
half-and-half.

Now if SPILLED is completely off the mark, e.g. we say it has
a probability of 0.51 but the actual runtime probability is more
like 0.1, then clearly we're in trouble.  I'm not sure what we can
do about that though.  Perhaps if the edge frequencies (and derived
block frequencies) are pure guesses, we should weight based on
the approximate cost of the block instead.  I.e. the call block
gets a low frequency because calls are expensive, while the
non-call block gets a high frequency because it does simple
arithmetic.  Thus any new instructions added to the call block
are likely to have a proportionately lower effect than any new
instructions added to the non-call block.

I hope we don't have to do that though :-)

> > > > which is cheaper than both the current approaches.  We don't do that
> > > > optimisation yet though, so the current costing seems to reflect what we
> > > > currently generate.
> > > 
> > > In many (if not most) Arches stores are significantly cheaper than the 
> > > loads
> > > though. So the store before the call doesn't end up making that much of a
> > > difference, but yes it adds up if you have many of them.
> > Yeah.  Could we fix the problem that way instead?  The only reason IRA is
> > treating loads and stores as equal cost is because aarch64 asked it to :-)
> 
> I tried a quick check and it does fix the testcase but not the benchmark.
> which
> is not entirely unexpected thinking about it because x86 does correctly
> model the
> store costs.
Yeah, was afraid that would be the case.

> I can try fixing the costs correctly and try reducing again.  It looks like
> it still
> thinks spilling to memory is cheaper than caller saves reloads.
Thanks.  I think that would help.  The current aarch64 costs are likely
to distort things if we try to reduce using them.

IMO the initial reduction has been useful because it gives a nice
crystallised example of why the load/store cost ratio matters.

[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies

2021-12-08 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #17 from Tamar Christina  ---
> On “CALL_FREQ grows much quicker than BB_FREQ”: for r104, the
> ALLOCNO_FREQ ought in principle to be fixed for a given loop iteration
> count.  It shouldn't grow or shrink based on the value of SPILLED.
> That's because every execution of the loop body involves exactly one
> reference to r104.  SPILLED specifies the probability that that single
> reference is the “call” use rather than the “non-call” use, but it doesn't
> change the total number of references per iteration.
> 
> So I think the only reason we see the different ALLOCNO_FREQs in:
> 
>ALLOCNO_FREQ 989, …
> 
> vs:
> 
>ALLOCNO_FREQ 990, …
> 
> is round-off error.  If the values had more precision, I think we'd
> have a fixed ALLOCNO_FREQ and a varying ALLOCNO_CALL_FREQ.

yeah, that's plausible, as far as I can tell the FREQ are always scaled by
REG_FREQ_FROM_EDGE_FREQ into [0, BB_FREQ_MAX] and that indeed does an
integer division.  The general problem is that the IPA frequences don't
really seem to have any bounded range and so it always needs to scale.

So I think you're always going to have this error one way or another
which may or may not work to your advantage on any given program.

Maybe we need a way to be a bit more tolerant of this rounding error
instead?

> > Instead I've chosen a middle ground here (same as yours but done in
> > ira_tune_allocno_costs instead), which is to store and load only inside
> > the loop, but to do so only in the BB which contains the call.
> I don't think you were saying otherwise, but just FTR: I wasn't
> proposing a solution, I was just describing a hack.  It seemed
> to me like IRA was making the right decision for r104 in isolation,
> for the given SPILLED value and target costs.  My hack to force
> an allocation for r104 made things worse.

Ah ok, fair enough :)

> 
> > > which is cheaper than both the current approaches.  We don't do that
> > > optimisation yet though, so the current costing seems to reflect what we
> > > currently generate.
> > 
> > In many (if not most) Arches stores are significantly cheaper than the loads
> > though. So the store before the call doesn't end up making that much of a
> > difference, but yes it adds up if you have many of them.
> Yeah.  Could we fix the problem that way instead?  The only reason IRA is
> treating loads and stores as equal cost is because aarch64 asked it to :-)

I tried a quick check and it does fix the testcase but not the benchmark. which
is not entirely unexpected thinking about it because x86 does correctly model
the
store costs.

I can try fixing the costs correctly and try reducing again.  It looks like it
still
thinks spilling to memory is cheaper than caller saves reloads.

[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies

2021-12-08 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #16 from rsandifo at gcc dot gnu.org  
---
(In reply to Tamar Christina from comment #15)
> > From that point of view, it doesn't look like the memory and register
> > costs of R are too wrong here.  The things being costed are the store
> > and load around the call (which do exist if we allocate a call-clobbered
> > register) and the loads at each use site (which do exist if we spill R).
> 
> Indeed, I don't think the heuristics are wrong, but because one frequency
> CALL_FREQ grows much quicker than BB_FREQ and at the smaller values they are
> a
> bit sensitive to any changes.  The edge probabilities can barely change while
> the BB_FREQ can change dramatically.
On “CALL_FREQ grows much quicker than BB_FREQ”: for r104, the
ALLOCNO_FREQ ought in principle to be fixed for a given loop iteration
count.  It shouldn't grow or shrink based on the value of SPILLED.
That's because every execution of the loop body involves exactly one
reference to r104.  SPILLED specifies the probability that that single
reference is the “call” use rather than the “non-call” use, but it doesn't
change the total number of references per iteration.

So I think the only reason we see the different ALLOCNO_FREQs in:

   ALLOCNO_FREQ 989, …

vs:

   ALLOCNO_FREQ 990, …

is round-off error.  If the values had more precision, I think we'd
have a fixed ALLOCNO_FREQ and a varying ALLOCNO_CALL_FREQ.

At one extreme, if SPILLED is very low (but if we nevertheless
continue to duplicate the common part of the loop body) then
storing and loading around the call becomes very cheap
(ALLOCNO_CALL_FREQ is much lower than the conceptually fixed
ALLOCNO_FREQ).  If SPILLED is very high (but again we continue
to duplicate the common part of the loop body) then storing and
loading around the call becomes very expensive.  So I guess the
question is: where is the cut-off?

Given that:

- the loop iterates 1024 times
- there is a single store outside the loop
- the target claims that loads and stores have equal cost

ISTM that the cut-off really is in the range [0.5, 0.51].  If the
value of SPILLED reflects the real probability at runtime then I think
IRA is doing the right thing, given the claimed costs of storing and
loading.

I guess the problems are:

- SPILLED is only a guess
- the aarch64 costs of stores and loads don't reflect reality

> > Like Feng Xue says in comment 1, I think the main missed optimisation
> > opportunity here is that foo + 1024 is invariant, so if we allocate
> > a call-clobbered register, we could save R once outside the loop
> > and reload it after each call.  That would give:
> > 
> > - a store of R outside the loop (low execution count)
> > - a load of R inside the loop (after the call) with freq 0.51 * loop iters
> > 
> 
> Yes, that is the ideal solution, but also requires more changes to RA.
True :-)

> Instead I've chosen a middle ground here (same as yours but done in
> ira_tune_allocno_costs instead), which is to store and load only inside
> the loop, but to do so only in the BB which contains the call.
I don't think you were saying otherwise, but just FTR: I wasn't
proposing a solution, I was just describing a hack.  It seemed
to me like IRA was making the right decision for r104 in isolation,
for the given SPILLED value and target costs.  My hack to force
an allocation for r104 made things worse.

> > which is cheaper than both the current approaches.  We don't do that
> > optimisation yet though, so the current costing seems to reflect what we
> > currently generate.
> 
> In many (if not most) Arches stores are significantly cheaper than the loads
> though. So the store before the call doesn't end up making that much of a
> difference, but yes it adds up if you have many of them.
Yeah.  Could we fix the problem that way instead?  The only reason IRA is
treating loads and stores as equal cost is because aarch64 asked it to :-)

At least for the reduced testcase, ISTM that IRA is making the right choice
for the “loads and stores are equally expensive” assumption.  It also seems
to make the right choice for a “loads are more expensive than stores”
assumption.  It's up to the target to choose which is right.

So I think the reduced testcase is showing a target bug.

[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies

2021-12-07 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #15 from Tamar Christina  ---
> That is, we're trading two memory accesses in the call branch
> (if we allocate R) against one memory access in both branches
> (if we spill R).  As the call branch gets more likely,
> the cost of doing two memory accesses there gets higher
> relative to the cost of doing one memory access in both branches.
> And that seems like the right behaviour in principle.
> 
> From that point of view, it doesn't look like the memory and register
> costs of R are too wrong here.  The things being costed are the store
> and load around the call (which do exist if we allocate a call-clobbered
> register) and the loads at each use site (which do exist if we spill R).

Indeed, I don't think the heuristics are wrong, but because one frequency
CALL_FREQ grows much quicker than BB_FREQ and at the smaller values they are a
bit sensitive to any changes.  The edge probabilities can barely change while
the BB_FREQ can change dramatically.

> 
> Like Feng Xue says in comment 1, I think the main missed optimisation
> opportunity here is that foo + 1024 is invariant, so if we allocate
> a call-clobbered register, we could save R once outside the loop
> and reload it after each call.  That would give:
> 
> - a store of R outside the loop (low execution count)
> - a load of R inside the loop (after the call) with freq 0.51 * loop iters
> 

Yes, that is the ideal solution, but also requires more changes to RA.
Instead I've chosen a middle ground here (same as yours but done in
ira_tune_allocno_costs instead), which is to store and load only inside
the loop, but to do so only in the BB which contains the call.

This is a major improvement over the current situation because when you
have several nested loops where the value is invariant across a number of them
you run into problems when each of these BB have naturally very high register
pressure.

As you say:

> - a store of R outside the loop (low execution count)
> - a load of R inside the loop (after the call) with freq 0.51 * loop iters
> - a load of R inside the loop with freq 0.49 * loop iters

and if the loop has various BB (like a long if/then/elseif/else) chain the load
has to happen in in every BB in the loop.  That's why we get the large amount
of spills we currently do.

By forcing it to spill only in the BB with the call inside the loop the other
BBs are freed from all the loads.

> If we force R to be allocated a call-clobbered register instead
> of being spilled (and changing nothing else, via a hack to
> ira-color.c:improve_allocation) then we generate:
> 
> - a store of R inside the loop (before the call) with freq 0.51 * loop iters
> - a load of R inside the loop (after the call) with freq 0.51 * loop iters

I essentially did the same thing, but I think in a more conservative way. When
you just have a single call inside the entire loop I force it to assign the
call-clobbered if it needs it.  This removed the loads with freq 0.49.  But
left the ones with 0.51.

I use call counts as the measure here because with 1 call and multiple BB
inside the live range you know that at most 1 BB will have a call and so the
rest won't have any.

Since essentially if you have high register pressure, just only make the part
that increases it more pay for the spills.  As you say it's not perfect, but
it's a conservative improvement over the current situation.

> which is cheaper than both the current approaches.  We don't do that
> optimisation yet though, so the current costing seems to reflect what we
> currently generate.

In many (if not most) Arches stores are significantly cheaper than the loads
though. So the store before the call doesn't end up making that much of a
difference, but yes it adds up if you have many of them.

So indeed removing it is optimal, but that seems like a very hard one to do.  I
would assume that the live range for the loop starts at the body of the loop. 
So I would imagine it's very hard to tell reload to spill outside of the
current allocas it's currently allocating for?

> 
> I don't know how well the above translates to the original example
> though.  Are the some of the spilled values in exchange loop-invariant
> as well?

Yes I believe so, It's a bit hard for me to tell since the functions are huge
and have many nested loops... But in rtl the BBs quite large as well after the
constprop and recursive inlining stuff.

But the behaviour is consistent with the minimal problem here.

[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies

2021-12-07 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #14 from rsandifo at gcc dot gnu.org  
---
Thanks for the nice cut-down example.

In the original comment and the covering note to patch 1,
the highlighted spilled value is the final (terminating)
value of foo + 1024.  Let's start with the SPILLED=0.51
case and consider that register ("R") in isolation without
changing anything else.

If we spill R to memory (as for the current SPILLED=0.51 code),
we generate:

- a store of R outside the loop (low execution count)
- a load of R inside the loop (after the call) with freq 0.51 * loop iters
- a load of R inside the loop with freq 0.49 * loop iters

If we force R to be allocated a call-clobbered register instead
of being spilled (and changing nothing else, via a hack to
ira-color.c:improve_allocation) then we generate:

- a store of R inside the loop (before the call) with freq 0.51 * loop iters
- a load of R inside the loop (after the call) with freq 0.51 * loop iters

So the in-loop cost of the second (allocated) version is higher
than the in-loop cost of the first (spilled) version.  As the value
of SPILLED increases, the difference between the two also increases:
spilling R gets progressively “cheaper” than the allocating a
call-clobbered register to R.

That is, we're trading two memory accesses in the call branch
(if we allocate R) against one memory access in both branches
(if we spill R).  As the call branch gets more likely,
the cost of doing two memory accesses there gets higher
relative to the cost of doing one memory access in both branches.
And that seems like the right behaviour in principle.

>From that point of view, it doesn't look like the memory and register
costs of R are too wrong here.  The things being costed are the store
and load around the call (which do exist if we allocate a call-clobbered
register) and the loads at each use site (which do exist if we spill R).

Like Feng Xue says in comment 1, I think the main missed optimisation
opportunity here is that foo + 1024 is invariant, so if we allocate
a call-clobbered register, we could save R once outside the loop
and reload it after each call.  That would give:

- a store of R outside the loop (low execution count)
- a load of R inside the loop (after the call) with freq 0.51 * loop iters

which is cheaper than both the current approaches.  We don't do that
optimisation yet though, so the current costing seems to reflect what we
currently generate.

I don't know how well the above translates to the original example
though.  Are the some of the spilled values in exchange loop-invariant
as well?

[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies

2021-12-07 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #13 from Tamar Christina  ---
Created attachment 51943
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51943=edit
0002-PATCH-2-2-GCC-reload-Don-t-move-if-the-next-BB-has-m.patch

[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies

2021-12-07 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #12 from Tamar Christina  ---
Created attachment 51942
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51942=edit
0001-PATCH-1-2-GCC-reload-Weigh-available-callee-saves-ba.patch

[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies

2021-12-07 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #11 from Tamar Christina  ---
(In reply to Jan Hubicka from comment #10)
> Thanks for looking into this. I was planning to try to contact Vladimir
> about the IRA behaviour here, but there was always something else to work
> with higher priority.  I wonder if you could possibly attach the WIP patch?
> 

I did exchange a few emails on the subject with him, and we agreed that the
best way forward would be to try to improve the heuristics instead of changing
the reg allocator algorithm or handle live ranged differently.  So the patches
I've attached try to do just that.

They're git patches where the description contains what they're trying to do.
With -fno-inline-functions-called-once on AArch64 they add about a 13%
improvement on exchange and no regressions anywhere else. On x86 the
improvement last measured was 24% and reduces the number of spills by ~90%. 
The improvement on x86 is higher since AArch64 has more integer registers so we
had more leeway here.

> Teaching inliner to not inline this function as called once is probably just
> a matter of adding a limit capping loop depth. I think that is meaningful
> heuristics since inlining very large function to large loop depth is
> probably not a good idea. I have patch for testing.

That's great! It would be nice not needing the flag anymore.  I tried to see if
I could address this from a regalloc point of view but it didn't look
promising...

> 
> Do you know of other benchmarks where -fno-inline-functions-called-once help
> (and ideally know why?).  We plan to look into roms and tramp3d which also
> regresses.

Hmm no I can't say I do, we only use it for SPECCPU Intrate for Fortran
benchmarks so haven't tracked the others with it.

[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies

2021-12-03 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #10 from Jan Hubicka  ---
Thanks for looking into this. I was planning to try to contact Vladimir about
the IRA behaviour here, but there was always something else to work with higher
priority.  I wonder if you could possibly attach the WIP patch?

Teaching inliner to not inline this function as called once is probably just a
matter of adding a limit capping loop depth. I think that is meaningful
heuristics since inlining very large function to large loop depth is probably
not a good idea. I have patch for testing.

Do you know of other benchmarks where -fno-inline-functions-called-once help
(and ideally know why?).  We plan to look into roms and tramp3d which also
regresses.

[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies

2021-12-03 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

Jan Hubicka  changed:

   What|Removed |Added

 CC||pthaugen at gcc dot gnu.org

--- Comment #9 from Jan Hubicka  ---
*** Bug 96825 has been marked as a duplicate of this bug. ***

[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies

2021-11-28 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #8 from Tamar Christina  ---
> 
> I wonder how the situation looks on AArch64?

The situation didn't improve, up until the end of stage-1 we were seeing a 6%
perf uplift from somewhere which seems to have gone away now (in a commit range
with a non IPA related patch).

The major problems is still the spills. Talking to Vlad I took at look at
improving the Chaitin-Briggs heuristics for spilling during the presence of
calls and how it tries to improve the allocation by moving spills along the
call gaph.

By improving on these heuristics I was able to reduce the number of spills and
saw improvements on both x86 and AArch64 which brought us back to the old
numbers.

However this same information is used by other areas such as register
preferences and so I had a regression in shrink wrapping.  There's also an
issue where x86 seems to assign negative values to register costs to indicate
they REALLY want this register.  This seems to work because the penalty applied
usually is large and it cancels out the negative cost.  But now the value stays
negative causing the register to not be used instead.

To fix these I need to keep track of the penalties and the costs separately but
did not get time to finish that work before the end of stage-1.

[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies

2021-11-28 Thread jiangning.liu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #7 from Jiangning Liu  ---
Without reverting the commit g:1118a3ff9d3ad6a64bba25dc01e7703325e23d92, we
still see exchange2 performance issue for aarch64. BTW, we have been using
-fno-inline-functions-called-once to get the best performance number for
exchange2.

[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies

2021-11-28 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #6 from Jan Hubicka  ---
I am sorry for getting back to this again late.  This stage1 we spent some time
with Martin improving the ipa-cp profile updating and looked again into the
realism of the profile. Also recently the codegen has improved somewhat due to
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103227 and due to modref
propagation.

I still believe for trunk GCC we should not have patch that intentionally makes
profile unrealistic just to make IRA work better by accident since it does not
seem to help anything real world except this somewhat odd benchmark. So I
wonder if we can make profile to work better for IRA without actually making it
unrealistic and tampering with ipa-cp cloning heuristics

I added code that compares guessed profile with feedback
https://gcc.gnu.org/pipermail/gcc-patches/2021-November/585599.html
and also fixed/improved code to dump stats about profile updates
https://gcc.gnu.org/pipermail/gcc-patches/2021-November/585578.html

This gives bit more handle on how realistic the profile is.  Answer is that not
very in general, but at least for basic blocks containing calls it is not bad
(we guess 0.9 while relity is 0.999).
I am not sure how much better we can do statically since this is such a special
case of backtracking.

Last week we also noticed that with -Ofast we inline the newly produced clones
together which makes IRA job a lot harder.  This is done by
-finline-functions-called-once and we tend to inline blocks of 2 or 3 clones
leading to 18 or 27 nested loops in each.  Simply disabling this optimization
gets another performance hit.

I filled in PR https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103454 and I think
we could teach the inliner to not inline functions called once in large loop
depths and restrict the large functions growths here since there are multiple
benchmarks that now degrade on this.

Worse yet, the heuristics for inlininig functions called once is not very smart
and it depends on the order of cgrpah_nodes in the linked list which is bit
random.

I wonder how the situation looks on AArch64?