Re: [Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
on zen2 and 3 with -flto the speedup seems to be cca 12% for both -O2 and -Ofast -march=native which is both very nice! Zen1 for some reason sees less improvement, about 6%. With PGO it is 3.8% Overall it seems a win, but there are few noteworthy issues. I also see a 6.69% regression on x64 with -Ofast -march=native -flto https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=475.377.0 and perhaps 3-5% on sphinx https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=476.280.0 https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=227.280.0 For non-spec benchmarks spec there is a regression on nbench https://lnt.opensuse.org/db_default/v4/CPP/graph?plot.0=26.645.1 There are also large changes in tsvc https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report it may be noise since kernels are tiny, but for example x293 reproduces both on kabylake and zen by about 80-90% regression that may be easy to track (the kernel is included in the testsuite). Same regression is not seen on zen3, so may be an ISA specific or so. FInally there seems relatively large code size savings on polyhedron benchmarks today (8% on capacita, Thanks a lot!
[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782 --- Comment #40 from CVS Commits --- The master branch has been updated by Richard Sandiford : https://gcc.gnu.org/g:037cc0b4a6646cc86549247a3590215ebd5c4c43 commit r12-6416-g037cc0b4a6646cc86549247a3590215ebd5c4c43 Author: Richard Sandiford Date: Mon Jan 10 14:47:09 2022 + ira: Handle "soft" conflicts between cap and non-cap allocnos This patch looks for allocno conflicts of the following form: - One allocno (X) is a cap allocno for some non-cap allocno X2. - X2 belongs to some loop L2. - The other allocno (Y) is a non-cap allocno. - Y is an ancestor of some allocno Y2 in L2. - Y2 is not referenced in L2 (that is, ALLOCNO_NREFS (Y2) == 0). - Y can use a different allocation from Y2. In this case, Y's register is live across L2 but is not used within it, whereas X's register is used only within L2. The conflict is therefore only "soft", in that it can easily be avoided by spilling Y2 inside L2 without affecting any insn references. In principle we could do this for ALLOCNO_NREFS (Y2) != 0 too, with the callers then taking Y2's ALLOCNO_MEMORY_COST into account. There would then be no "cliff edge" between a Y2 that has no references and a Y2 that has (say) a single cold reference. However, doing that isn't necessary for the PR and seems to give variable results in practice. (fotonik3d_r improves slightly but namd_r regresses slightly.) It therefore seemed better to start with the higher-value zero-reference case and see how things go. On top of the previous patches in the series, this fixes the exchange2 regression seen in GCC 11. gcc/ PR rtl-optimization/98782 * ira-int.h (ira_soft_conflict): Declare. * ira-color.c (max_soft_conflict_loop_depth): New constant. (ira_soft_conflict): New function. (spill_soft_conflicts): Likewise. (assign_hard_reg): Use them to handle the case described by the comment above ira_soft_conflict. (improve_allocation): Likewise. * ira.c (check_allocation): Allow allocnos with "soft" conflicts to share the same register. gcc/testsuite/ * gcc.target/aarch64/reg-alloc-4.c: New test.
[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782 --- Comment #39 from CVS Commits --- The master branch has been updated by Richard Sandiford : https://gcc.gnu.org/g:01f3e6a40e7202310abbeb41c345d325bd69554f commit r12-6415-g01f3e6a40e7202310abbeb41c345d325bd69554f Author: Richard Sandiford Date: Mon Jan 10 14:47:08 2022 + ira: Consider modelling caller-save allocations as loop spills If an allocno A in an inner loop L spans a call, a parent allocno AP can choose to handle a call-clobbered/caller-saved hard register R in one of two ways: (1) save R before each call in L and restore R after each call (2) spill R to memory throughout L (2) can be cheaper than (1) in some cases, particularly if L does not reference A. Before the patch we always did (1). The patch adds support for picking (2) instead, when it seems cheaper. It builds on the earlier support for not propagating conflicts to parent allocnos. gcc/ PR rtl-optimization/98782 * ira-int.h (ira_caller_save_cost): New function. (ira_caller_save_loop_spill_p): Likewise. * ira-build.c (ira_propagate_hard_reg_costs): Test whether it is cheaper to spill a call-clobbered register throughout a loop rather than spill it around each individual call. If so, treat all call-clobbered registers as conflicts and... (propagate_allocno_info): ...do not propagate call information from the child to the parent. * ira-color.c (move_spill_restore): Update accordingly. * ira-costs.c (ira_tune_allocno_costs): Use ira_caller_save_cost. gcc/testsuite/ * gcc.target/aarch64/reg-alloc-3.c: New test.
[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782 --- Comment #38 from CVS Commits --- The master branch has been updated by Richard Sandiford : https://gcc.gnu.org/g:d54565d87ff79b882208dfb29af50232033c233d commit r12-6413-gd54565d87ff79b882208dfb29af50232033c233d Author: Richard Sandiford Date: Mon Jan 10 14:47:07 2022 + ira: Add ira_subloop_allocnos_can_differ_p color_pass has two instances of the same code for propagating non-cap assignments from parent loops to subloops. This patch adds a helper function for testing when such propagations are required for correctness and uses it to remove the duplicated code. A later patch will use this in ira-build.c too, which is why the function is exported to ira-int.h. No functional change intended. gcc/ PR rtl-optimization/98782 * ira-int.h (ira_subloop_allocnos_can_differ_p): New function, extracted from... * ira-color.c (color_pass): ...here.
[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782 --- Comment #37 from CVS Commits --- The master branch has been updated by Richard Sandiford : https://gcc.gnu.org/g:909a4b4764c4f270f09ccb2a950c91b21ed7b33a commit r12-6412-g909a4b4764c4f270f09ccb2a950c91b21ed7b33a Author: Richard Sandiford Date: Mon Jan 10 14:47:07 2022 + ira: Add comments and fix move_spill_restore calculation This patch adds comments to describe each use of ira_loop_border_costs. I think this highlights that move_spill_restore was using the wrong cost in one case, which came from tranposing [0] and [1] in the original (pre-ira_loop_border_costs) ira_memory_move_cost expressions. The difference would only be noticeable on targets that distinguish between load and store costs. gcc/ PR rtl-optimization/98782 * ira-color.c (color_pass): Add comments to describe the spill costs. (move_spill_restore): Likewise. Fix reversed calculation.
[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782 --- Comment #36 from Hongtao.liu --- (In reply to pthaugen from comment #33) > (In reply to rsand...@gcc.gnu.org from comment #32) > > Created attachment 52102 [details] > > Alternative patch > > > > This patch is a squash of several ira tweaks that together recover the > > pre-GCC11 exchange2 performance on aarch64. It isn't ready for trunk > > yet (hence lack of comments and changelog). It would be great to hear > > whether/how it works on other targets though. > > I tried the patch on a Power9 system. Execution time went from 371 seconds > to 291. The patch also recover 548.exchange2_r performance on ICX, thanks.
[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782 rsandifo at gcc dot gnu.org changed: What|Removed |Added URL||https://gcc.gnu.org/piperma ||il/gcc-patches/2022-January ||/587761.html --- Comment #35 from rsandifo at gcc dot gnu.org --- (In reply to pthaugen from comment #33) > I tried the patch on a Power9 system. Execution time went from 371 seconds > to 291. Thanks Pat! Given that it seems to work OK on aarch64 and Power, I've submitted a fixed version of the series for review: https://gcc.gnu.org/pipermail/gcc-patches/2022-January/587761.html
[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782 --- Comment #34 from pthaugen at gcc dot gnu.org --- (In reply to pthaugen from comment #33) > > I tried the patch on a Power9 system. Execution time went from 371 seconds > to 291. Which I should have included is in line, or even slightly better, than the 2 patches posted by Tamar.
[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782 --- Comment #33 from pthaugen at gcc dot gnu.org --- (In reply to rsand...@gcc.gnu.org from comment #32) > Created attachment 52102 [details] > Alternative patch > > This patch is a squash of several ira tweaks that together recover the > pre-GCC11 exchange2 performance on aarch64. It isn't ready for trunk > yet (hence lack of comments and changelog). It would be great to hear > whether/how it works on other targets though. I tried the patch on a Power9 system. Execution time went from 371 seconds to 291.
[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782 rsandifo at gcc dot gnu.org changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org --- Comment #32 from rsandifo at gcc dot gnu.org --- Created attachment 52102 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52102=edit Alternative patch This patch is a squash of several ira tweaks that together recover the pre-GCC11 exchange2 performance on aarch64. It isn't ready for trunk yet (hence lack of comments and changelog). It would be great to hear whether/how it works on other targets though. The patch bootstraps on aarch64-linux-gnu and x86_64-linux-gnu, but there are some new scan-assembler failures that need looking at. Quoting from the covering message: The main changes are: (1) Add ira_loop_border_costs to simplify spill cost calculations (NFC intended) (2) Avoid freeing updated costs until the loop node has been fully allocated. This in turn allows: (3) Make improve_allocation work exclusively on updated costs, rather than using a mixture of updated and original costs. One reason this matters is that the register costs only make sense relative to the memory costs, so in some cases, a common register is subtracted from the updated memory cost instead of being added to each individual updated register cost. (4) If a child allocno has a hard register conflict, allow the parent allocno to handle the conflict by spilling to memory throughout the child allocno's loop. This carries the child allocno's full memory cost plus the cost of spilling to memory on entry to the loop and restoring it on exit, but this can still be cheaper than spilling the entire parent allocno. In particular, it helps for allocnos that are live across a loop but not referenced within it, since the child allocno's memory cost is 0 in that case. (5) Extend (4) to cases in which the child allocno is live across a call. The parent then has a free choice between spilling call-clobbered registers around each call (as normal) or spilling them on entry to the loop, keeping the allocno in memory throughout the loop, and restoring them on exit from the loop. (6) Detect <80><9C>soft conflicts<80><9D> in which: - one allocno (A1) is a cap whose (transitive) <80><9C>real<80><9D> allocno is A1' - A1' occurs in loop L1' - the other allocno (A2) is a non-cap allocno - the equivalent of A2 is live across L1' (hence the conflict) but has no references in L1' In this case we can spill A2 around L1' (or perhaps some parent loop) and reuse the same register for A1'. A1 and A2 can then use the same hard register, provided that we make sure not to propagate A1's allocation to A1'.
[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782 --- Comment #31 from rsandifo at gcc dot gnu.org --- I don't know how relevant this is to the exchange2 problem yet, but for: - #define PROB 0.95 struct L { int data; L *next; L *inner; }; template struct S { static __attribute__((always_inline)) void f(L *head, int inc) { while (head) { asm volatile ("// Loop %0" :: "i" (N)); int subinc = head->data + inc; if (__builtin_expect_with_probability (bool(head->inner), 0, PROB)) S::f(head->inner, subinc); head->data = subinc; head = head->inner; } } }; template<> struct S<0> { static void f(L *, int) { asm volatile ("// foo" ::: "x0", "x1", "x2", "x3", "x4", "x5", "x6", "x7"); } }; void x(L *head) { S<10>::f(head, 1); } - compiled on aarch64 with -O3, we always seem to spill in the outer loops regardless of the value of PROB. That is, the inner loops always seem to get priority even if their bb frequencies are low. I would have expected that moving PROB from 0.05 (inner loop very likely to be executed) to 0.95 (inner loop very likely to be skipped) would have moved the spills from the outer loops to the inner loops. An unrelated issue on aarch64 is that IP0 and IP1 are explicitly clobbered by call instructions, via CALL_INSN_FUNCTION_USAGE. We have to do this because the linker is allowed to insert code that clobbers those registers even if we “know” that the target of the call doesn't clobber them. However, clobbering the registers that way prevents them from being used for registers that are live across the call, even if the call is very unlikely to be executed. Hacking a fix for that does reduce the number of spills, but doesn't get rid of the ones that matter in exchange2.
[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782 --- Comment #30 from Tamar Christina --- (In reply to Martin Jambor from comment #29) > (In reply to Tamar Christina from comment #23) > > I wonder if we can get rid of the final magic parameter too, we run with > > --param ipa-cp-unit-growth=80 too which seems to have no more effect on > > exchange, though still a slight effect on leela but a 12% gain in imagick. > > > > This looks like without the parameter we lose constprop on MorphologyApply > > resulting in a much bigger function. Do you want me to make a new ticket > > for this? > > Indeed I did not know about this. I think tracking it in a (separate) PR > would make sense. Thanks a lot for the pointer! Thanks, created a new ticket https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103734 the threshold seems to be a lot lower than 80%, 20% seems to already be enough.
[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782 --- Comment #29 from Martin Jambor --- (In reply to Tamar Christina from comment #23) > I wonder if we can get rid of the final magic parameter too, we run with > --param ipa-cp-unit-growth=80 too which seems to have no more effect on > exchange, though still a slight effect on leela but a 12% gain in imagick. > > This looks like without the parameter we lose constprop on MorphologyApply > resulting in a much bigger function. Do you want me to make a new ticket > for this? Indeed I did not know about this. I think tracking it in a (separate) PR would make sense. Thanks a lot for the pointer!
[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782 --- Comment #28 from Tamar Christina --- (In reply to pthaugen from comment #19) > I tried -fno-inline-functions-called-once and the patches on a Power9 > system. Following are the run times and spill counts (grep -c Spilling > exchange2.fppized.f90.298r.ira). Interesting that the spill counts increase > when -fno-inline-functions-called-once is added, but obviously that > additional spill is on cold paths since execution time improves. Compiler > options used are "-O3 -mcpu=power9 -fpeel-loops -funroll-loops -ffast-math". > > time(sec) Spill > base473 3284 > > no-inline-functions-called-once 370 5410 > > patches 1 & 2 397461 > > patches 1 & 2 > + no-inline-functions-called-once 299870 Ah thanks for testing!, so at least we know that all architectures are suffering from the same underlying issue :)
[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782 --- Comment #27 from Tamar Christina --- (In reply to hubicka from comment #26) > > It's with LTO, I'll see if non-LTO has the same benefit. In terms of > > code-size > > it looks like it accounts for a 20% increase for binary size, but the hot > > function shrinks approx 6x. But of course this increase covers the entire > > binaries and there are other hot functions like GetVirtualPixelsFromNexus > > that > > also shrink significantly when specialized. > > Thanks. I will check this. We mey need tweaking cost model to first > specialize the functions that reduce more (by doing multiple passes with > decreasing threshold) Thanks! It looks like it doesn't need LTO, just plain -Ofast is enough to get the gain. > > It is really nice to see some real improvmeents from ipa-cp. For years > the pass did almost nothing on the benchmarks I tested (spec, Firefox, > clang) but it has changed. I also see measurable speedups on clang > binary with ipa-cp on. Indeed! Some of the gains from these cp's are quite nice!
[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782 --- Comment #26 from hubicka at kam dot mff.cuni.cz --- > It's with LTO, I'll see if non-LTO has the same benefit. In terms of > code-size > it looks like it accounts for a 20% increase for binary size, but the hot > function shrinks approx 6x. But of course this increase covers the entire > binaries and there are other hot functions like GetVirtualPixelsFromNexus that > also shrink significantly when specialized. Thanks. I will check this. We mey need tweaking cost model to first specialize the functions that reduce more (by doing multiple passes with decreasing threshold) It is really nice to see some real improvmeents from ipa-cp. For years the pass did almost nothing on the benchmarks I tested (spec, Firefox, clang) but it has changed. I also see measurable speedups on clang binary with ipa-cp on.
[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782 --- Comment #25 from Tamar Christina --- (In reply to hubicka from comment #24) > > Awesome! thanks! > > > > I wonder if we can get rid of the final magic parameter too, we run with > > --param ipa-cp-unit-growth=80 too which seems to have no more effect on > > exchange, though still a slight effect on leela but a 12% gain in imagick. > > Interesting - this is Martin Jambor's area but I do not think we was > aware of this. I wonder how bug growth is really needed - 80% is quite > a lot especially if imagemagick is already big binary. Is this about > LTO or non-LTO build? It's with LTO, I'll see if non-LTO has the same benefit. In terms of code-size it looks like it accounts for a 20% increase for binary size, but the hot function shrinks approx 6x. But of course this increase covers the entire binaries and there are other hot functions like GetVirtualPixelsFromNexus that also shrink significantly when specialized.
[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782 --- Comment #24 from hubicka at kam dot mff.cuni.cz --- > Awesome! thanks! > > I wonder if we can get rid of the final magic parameter too, we run with > --param ipa-cp-unit-growth=80 too which seems to have no more effect on > exchange, though still a slight effect on leela but a 12% gain in imagick. Interesting - this is Martin Jambor's area but I do not think we was aware of this. I wonder how bug growth is really needed - 80% is quite a lot especially if imagemagick is already big binary. Is this about LTO or non-LTO build?
[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782 --- Comment #23 from Tamar Christina --- (In reply to Jan Hubicka from comment #20) > With g:r12-5872-gf157c5362b4844f7676cae2aba81a4cf75bd68d5 we should no > longer need -fno-inline-functions-called-once Awesome! thanks! I wonder if we can get rid of the final magic parameter too, we run with --param ipa-cp-unit-growth=80 too which seems to have no more effect on exchange, though still a slight effect on leela but a 12% gain in imagick. This looks like without the parameter we lose constprop on MorphologyApply resulting in a much bigger function. Do you want me to make a new ticket for this?
[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782 --- Comment #22 from Jan Hubicka --- We discussed with Richard the option to biass the IRA spilling heuristics to spill into blocks containing calls (since they are slow anyway) bypassing some of the guessed frequencies. Thinking on it, perhaps I should note that I believe that even with profile feedback we seem to have IRA problem (it used to be possible to get faster code w/o -fprofile-use then with), so the default heuristics seems to be somehow misbehaving. With current guessed profile the frequency of BB containing the recursive call is about 0.9, while reality is about 0.99 which is quite close.
[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782 --- Comment #21 from pthaugen at gcc dot gnu.org --- (In reply to Jan Hubicka from comment #20) > With g:r12-5872-gf157c5362b4844f7676cae2aba81a4cf75bd68d5 we should no > longer need -fno-inline-functions-called-once Yes, I see that now with an updated trunk.
[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782 --- Comment #20 from Jan Hubicka --- With g:r12-5872-gf157c5362b4844f7676cae2aba81a4cf75bd68d5 we should no longer need -fno-inline-functions-called-once
[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782 --- Comment #19 from pthaugen at gcc dot gnu.org --- I tried -fno-inline-functions-called-once and the patches on a Power9 system. Following are the run times and spill counts (grep -c Spilling exchange2.fppized.f90.298r.ira). Interesting that the spill counts increase when -fno-inline-functions-called-once is added, but obviously that additional spill is on cold paths since execution time improves. Compiler options used are "-O3 -mcpu=power9 -fpeel-loops -funroll-loops -ffast-math". time(sec) Spill base473 3284 no-inline-functions-called-once 370 5410 patches 1 & 2 397461 patches 1 & 2 + no-inline-functions-called-once 299870
[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782 --- Comment #18 from rsandifo at gcc dot gnu.org --- (In reply to Tamar Christina from comment #17) > > On “CALL_FREQ grows much quicker than BB_FREQ”: for r104, the > > ALLOCNO_FREQ ought in principle to be fixed for a given loop iteration > > count. It shouldn't grow or shrink based on the value of SPILLED. > > That's because every execution of the loop body involves exactly one > > reference to r104. SPILLED specifies the probability that that single > > reference is the “call” use rather than the “non-call” use, but it doesn't > > change the total number of references per iteration. > > > > So I think the only reason we see the different ALLOCNO_FREQs in: > > > >ALLOCNO_FREQ 989, … > > > > vs: > > > >ALLOCNO_FREQ 990, … > > > > is round-off error. If the values had more precision, I think we'd > > have a fixed ALLOCNO_FREQ and a varying ALLOCNO_CALL_FREQ. > > yeah, that's plausible, as far as I can tell the FREQ are always scaled by > REG_FREQ_FROM_EDGE_FREQ into [0, BB_FREQ_MAX] and that indeed does an > integer division. The general problem is that the IPA frequences don't > really seem to have any bounded range and so it always needs to scale. > > So I think you're always going to have this error one way or another > which may or may not work to your advantage on any given program. > > Maybe we need a way to be a bit more tolerant of this rounding error > instead? I guess the thing is: with the costs that the target is giving IRA, there isn't much of a difference between the cost of the “0.5 way” and the cost of the “0.51 way” for values of SPILLED around the 0.5 mark. For spilling r104 we have: FactorCost type What === 1 store store outside the loop when R is set 1024×SPILLED load load inside the loop when R is used (call) 1024×(1-SPILLED) load load inside the loop when R is used (non-call) For allocating a call-clobbered register to r104 we have: FactorCost type What === 1024×SPILLED store store before the call 1024×SPILLED load load after the call When loads and store have equal weight, that weight cancels out, giving: 1025 vs. 2048×SPILLED So the costs are equal for SPILLED=1025/2048. Above that spilling is cheaper, below that allocating a call-clobbered register is cheaper. IRA has done well to find the exact tipping point (based on the assumptions it has been told to use), but for values of SPILLED around the tipping point, the two costs are very close. So with the current cost model, we're not falling off a cliff cost-wise when we hit the tipping point or rounding error. The costs coming in give the impression that there isn't much to chose between the two approaches when SPILLED is roughly half-and-half. Now if SPILLED is completely off the mark, e.g. we say it has a probability of 0.51 but the actual runtime probability is more like 0.1, then clearly we're in trouble. I'm not sure what we can do about that though. Perhaps if the edge frequencies (and derived block frequencies) are pure guesses, we should weight based on the approximate cost of the block instead. I.e. the call block gets a low frequency because calls are expensive, while the non-call block gets a high frequency because it does simple arithmetic. Thus any new instructions added to the call block are likely to have a proportionately lower effect than any new instructions added to the non-call block. I hope we don't have to do that though :-) > > > > which is cheaper than both the current approaches. We don't do that > > > > optimisation yet though, so the current costing seems to reflect what we > > > > currently generate. > > > > > > In many (if not most) Arches stores are significantly cheaper than the > > > loads > > > though. So the store before the call doesn't end up making that much of a > > > difference, but yes it adds up if you have many of them. > > Yeah. Could we fix the problem that way instead? The only reason IRA is > > treating loads and stores as equal cost is because aarch64 asked it to :-) > > I tried a quick check and it does fix the testcase but not the benchmark. > which > is not entirely unexpected thinking about it because x86 does correctly > model the > store costs. Yeah, was afraid that would be the case. > I can try fixing the costs correctly and try reducing again. It looks like > it still > thinks spilling to memory is cheaper than caller saves reloads. Thanks. I think that would help. The current aarch64 costs are likely to distort things if we try to reduce using them. IMO the initial reduction has been useful because it gives a nice crystallised example of why the load/store cost ratio matters.
[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782 --- Comment #17 from Tamar Christina --- > On “CALL_FREQ grows much quicker than BB_FREQ”: for r104, the > ALLOCNO_FREQ ought in principle to be fixed for a given loop iteration > count. It shouldn't grow or shrink based on the value of SPILLED. > That's because every execution of the loop body involves exactly one > reference to r104. SPILLED specifies the probability that that single > reference is the “call” use rather than the “non-call” use, but it doesn't > change the total number of references per iteration. > > So I think the only reason we see the different ALLOCNO_FREQs in: > >ALLOCNO_FREQ 989, … > > vs: > >ALLOCNO_FREQ 990, … > > is round-off error. If the values had more precision, I think we'd > have a fixed ALLOCNO_FREQ and a varying ALLOCNO_CALL_FREQ. yeah, that's plausible, as far as I can tell the FREQ are always scaled by REG_FREQ_FROM_EDGE_FREQ into [0, BB_FREQ_MAX] and that indeed does an integer division. The general problem is that the IPA frequences don't really seem to have any bounded range and so it always needs to scale. So I think you're always going to have this error one way or another which may or may not work to your advantage on any given program. Maybe we need a way to be a bit more tolerant of this rounding error instead? > > Instead I've chosen a middle ground here (same as yours but done in > > ira_tune_allocno_costs instead), which is to store and load only inside > > the loop, but to do so only in the BB which contains the call. > I don't think you were saying otherwise, but just FTR: I wasn't > proposing a solution, I was just describing a hack. It seemed > to me like IRA was making the right decision for r104 in isolation, > for the given SPILLED value and target costs. My hack to force > an allocation for r104 made things worse. Ah ok, fair enough :) > > > > which is cheaper than both the current approaches. We don't do that > > > optimisation yet though, so the current costing seems to reflect what we > > > currently generate. > > > > In many (if not most) Arches stores are significantly cheaper than the loads > > though. So the store before the call doesn't end up making that much of a > > difference, but yes it adds up if you have many of them. > Yeah. Could we fix the problem that way instead? The only reason IRA is > treating loads and stores as equal cost is because aarch64 asked it to :-) I tried a quick check and it does fix the testcase but not the benchmark. which is not entirely unexpected thinking about it because x86 does correctly model the store costs. I can try fixing the costs correctly and try reducing again. It looks like it still thinks spilling to memory is cheaper than caller saves reloads.
[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782 --- Comment #16 from rsandifo at gcc dot gnu.org --- (In reply to Tamar Christina from comment #15) > > From that point of view, it doesn't look like the memory and register > > costs of R are too wrong here. The things being costed are the store > > and load around the call (which do exist if we allocate a call-clobbered > > register) and the loads at each use site (which do exist if we spill R). > > Indeed, I don't think the heuristics are wrong, but because one frequency > CALL_FREQ grows much quicker than BB_FREQ and at the smaller values they are > a > bit sensitive to any changes. The edge probabilities can barely change while > the BB_FREQ can change dramatically. On “CALL_FREQ grows much quicker than BB_FREQ”: for r104, the ALLOCNO_FREQ ought in principle to be fixed for a given loop iteration count. It shouldn't grow or shrink based on the value of SPILLED. That's because every execution of the loop body involves exactly one reference to r104. SPILLED specifies the probability that that single reference is the “call” use rather than the “non-call” use, but it doesn't change the total number of references per iteration. So I think the only reason we see the different ALLOCNO_FREQs in: ALLOCNO_FREQ 989, … vs: ALLOCNO_FREQ 990, … is round-off error. If the values had more precision, I think we'd have a fixed ALLOCNO_FREQ and a varying ALLOCNO_CALL_FREQ. At one extreme, if SPILLED is very low (but if we nevertheless continue to duplicate the common part of the loop body) then storing and loading around the call becomes very cheap (ALLOCNO_CALL_FREQ is much lower than the conceptually fixed ALLOCNO_FREQ). If SPILLED is very high (but again we continue to duplicate the common part of the loop body) then storing and loading around the call becomes very expensive. So I guess the question is: where is the cut-off? Given that: - the loop iterates 1024 times - there is a single store outside the loop - the target claims that loads and stores have equal cost ISTM that the cut-off really is in the range [0.5, 0.51]. If the value of SPILLED reflects the real probability at runtime then I think IRA is doing the right thing, given the claimed costs of storing and loading. I guess the problems are: - SPILLED is only a guess - the aarch64 costs of stores and loads don't reflect reality > > Like Feng Xue says in comment 1, I think the main missed optimisation > > opportunity here is that foo + 1024 is invariant, so if we allocate > > a call-clobbered register, we could save R once outside the loop > > and reload it after each call. That would give: > > > > - a store of R outside the loop (low execution count) > > - a load of R inside the loop (after the call) with freq 0.51 * loop iters > > > > Yes, that is the ideal solution, but also requires more changes to RA. True :-) > Instead I've chosen a middle ground here (same as yours but done in > ira_tune_allocno_costs instead), which is to store and load only inside > the loop, but to do so only in the BB which contains the call. I don't think you were saying otherwise, but just FTR: I wasn't proposing a solution, I was just describing a hack. It seemed to me like IRA was making the right decision for r104 in isolation, for the given SPILLED value and target costs. My hack to force an allocation for r104 made things worse. > > which is cheaper than both the current approaches. We don't do that > > optimisation yet though, so the current costing seems to reflect what we > > currently generate. > > In many (if not most) Arches stores are significantly cheaper than the loads > though. So the store before the call doesn't end up making that much of a > difference, but yes it adds up if you have many of them. Yeah. Could we fix the problem that way instead? The only reason IRA is treating loads and stores as equal cost is because aarch64 asked it to :-) At least for the reduced testcase, ISTM that IRA is making the right choice for the “loads and stores are equally expensive” assumption. It also seems to make the right choice for a “loads are more expensive than stores” assumption. It's up to the target to choose which is right. So I think the reduced testcase is showing a target bug.
[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782 --- Comment #15 from Tamar Christina --- > That is, we're trading two memory accesses in the call branch > (if we allocate R) against one memory access in both branches > (if we spill R). As the call branch gets more likely, > the cost of doing two memory accesses there gets higher > relative to the cost of doing one memory access in both branches. > And that seems like the right behaviour in principle. > > From that point of view, it doesn't look like the memory and register > costs of R are too wrong here. The things being costed are the store > and load around the call (which do exist if we allocate a call-clobbered > register) and the loads at each use site (which do exist if we spill R). Indeed, I don't think the heuristics are wrong, but because one frequency CALL_FREQ grows much quicker than BB_FREQ and at the smaller values they are a bit sensitive to any changes. The edge probabilities can barely change while the BB_FREQ can change dramatically. > > Like Feng Xue says in comment 1, I think the main missed optimisation > opportunity here is that foo + 1024 is invariant, so if we allocate > a call-clobbered register, we could save R once outside the loop > and reload it after each call. That would give: > > - a store of R outside the loop (low execution count) > - a load of R inside the loop (after the call) with freq 0.51 * loop iters > Yes, that is the ideal solution, but also requires more changes to RA. Instead I've chosen a middle ground here (same as yours but done in ira_tune_allocno_costs instead), which is to store and load only inside the loop, but to do so only in the BB which contains the call. This is a major improvement over the current situation because when you have several nested loops where the value is invariant across a number of them you run into problems when each of these BB have naturally very high register pressure. As you say: > - a store of R outside the loop (low execution count) > - a load of R inside the loop (after the call) with freq 0.51 * loop iters > - a load of R inside the loop with freq 0.49 * loop iters and if the loop has various BB (like a long if/then/elseif/else) chain the load has to happen in in every BB in the loop. That's why we get the large amount of spills we currently do. By forcing it to spill only in the BB with the call inside the loop the other BBs are freed from all the loads. > If we force R to be allocated a call-clobbered register instead > of being spilled (and changing nothing else, via a hack to > ira-color.c:improve_allocation) then we generate: > > - a store of R inside the loop (before the call) with freq 0.51 * loop iters > - a load of R inside the loop (after the call) with freq 0.51 * loop iters I essentially did the same thing, but I think in a more conservative way. When you just have a single call inside the entire loop I force it to assign the call-clobbered if it needs it. This removed the loads with freq 0.49. But left the ones with 0.51. I use call counts as the measure here because with 1 call and multiple BB inside the live range you know that at most 1 BB will have a call and so the rest won't have any. Since essentially if you have high register pressure, just only make the part that increases it more pay for the spills. As you say it's not perfect, but it's a conservative improvement over the current situation. > which is cheaper than both the current approaches. We don't do that > optimisation yet though, so the current costing seems to reflect what we > currently generate. In many (if not most) Arches stores are significantly cheaper than the loads though. So the store before the call doesn't end up making that much of a difference, but yes it adds up if you have many of them. So indeed removing it is optimal, but that seems like a very hard one to do. I would assume that the live range for the loop starts at the body of the loop. So I would imagine it's very hard to tell reload to spill outside of the current allocas it's currently allocating for? > > I don't know how well the above translates to the original example > though. Are the some of the spilled values in exchange loop-invariant > as well? Yes I believe so, It's a bit hard for me to tell since the functions are huge and have many nested loops... But in rtl the BBs quite large as well after the constprop and recursive inlining stuff. But the behaviour is consistent with the minimal problem here.
[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782 --- Comment #14 from rsandifo at gcc dot gnu.org --- Thanks for the nice cut-down example. In the original comment and the covering note to patch 1, the highlighted spilled value is the final (terminating) value of foo + 1024. Let's start with the SPILLED=0.51 case and consider that register ("R") in isolation without changing anything else. If we spill R to memory (as for the current SPILLED=0.51 code), we generate: - a store of R outside the loop (low execution count) - a load of R inside the loop (after the call) with freq 0.51 * loop iters - a load of R inside the loop with freq 0.49 * loop iters If we force R to be allocated a call-clobbered register instead of being spilled (and changing nothing else, via a hack to ira-color.c:improve_allocation) then we generate: - a store of R inside the loop (before the call) with freq 0.51 * loop iters - a load of R inside the loop (after the call) with freq 0.51 * loop iters So the in-loop cost of the second (allocated) version is higher than the in-loop cost of the first (spilled) version. As the value of SPILLED increases, the difference between the two also increases: spilling R gets progressively “cheaper” than the allocating a call-clobbered register to R. That is, we're trading two memory accesses in the call branch (if we allocate R) against one memory access in both branches (if we spill R). As the call branch gets more likely, the cost of doing two memory accesses there gets higher relative to the cost of doing one memory access in both branches. And that seems like the right behaviour in principle. >From that point of view, it doesn't look like the memory and register costs of R are too wrong here. The things being costed are the store and load around the call (which do exist if we allocate a call-clobbered register) and the loads at each use site (which do exist if we spill R). Like Feng Xue says in comment 1, I think the main missed optimisation opportunity here is that foo + 1024 is invariant, so if we allocate a call-clobbered register, we could save R once outside the loop and reload it after each call. That would give: - a store of R outside the loop (low execution count) - a load of R inside the loop (after the call) with freq 0.51 * loop iters which is cheaper than both the current approaches. We don't do that optimisation yet though, so the current costing seems to reflect what we currently generate. I don't know how well the above translates to the original example though. Are the some of the spilled values in exchange loop-invariant as well?
[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782 --- Comment #13 from Tamar Christina --- Created attachment 51943 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51943=edit 0002-PATCH-2-2-GCC-reload-Don-t-move-if-the-next-BB-has-m.patch
[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782 --- Comment #12 from Tamar Christina --- Created attachment 51942 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51942=edit 0001-PATCH-1-2-GCC-reload-Weigh-available-callee-saves-ba.patch
[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782 --- Comment #11 from Tamar Christina --- (In reply to Jan Hubicka from comment #10) > Thanks for looking into this. I was planning to try to contact Vladimir > about the IRA behaviour here, but there was always something else to work > with higher priority. I wonder if you could possibly attach the WIP patch? > I did exchange a few emails on the subject with him, and we agreed that the best way forward would be to try to improve the heuristics instead of changing the reg allocator algorithm or handle live ranged differently. So the patches I've attached try to do just that. They're git patches where the description contains what they're trying to do. With -fno-inline-functions-called-once on AArch64 they add about a 13% improvement on exchange and no regressions anywhere else. On x86 the improvement last measured was 24% and reduces the number of spills by ~90%. The improvement on x86 is higher since AArch64 has more integer registers so we had more leeway here. > Teaching inliner to not inline this function as called once is probably just > a matter of adding a limit capping loop depth. I think that is meaningful > heuristics since inlining very large function to large loop depth is > probably not a good idea. I have patch for testing. That's great! It would be nice not needing the flag anymore. I tried to see if I could address this from a regalloc point of view but it didn't look promising... > > Do you know of other benchmarks where -fno-inline-functions-called-once help > (and ideally know why?). We plan to look into roms and tramp3d which also > regresses. Hmm no I can't say I do, we only use it for SPECCPU Intrate for Fortran benchmarks so haven't tracked the others with it.
[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782 --- Comment #10 from Jan Hubicka --- Thanks for looking into this. I was planning to try to contact Vladimir about the IRA behaviour here, but there was always something else to work with higher priority. I wonder if you could possibly attach the WIP patch? Teaching inliner to not inline this function as called once is probably just a matter of adding a limit capping loop depth. I think that is meaningful heuristics since inlining very large function to large loop depth is probably not a good idea. I have patch for testing. Do you know of other benchmarks where -fno-inline-functions-called-once help (and ideally know why?). We plan to look into roms and tramp3d which also regresses.
[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782 Jan Hubicka changed: What|Removed |Added CC||pthaugen at gcc dot gnu.org --- Comment #9 from Jan Hubicka --- *** Bug 96825 has been marked as a duplicate of this bug. ***
[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782 --- Comment #8 from Tamar Christina --- > > I wonder how the situation looks on AArch64? The situation didn't improve, up until the end of stage-1 we were seeing a 6% perf uplift from somewhere which seems to have gone away now (in a commit range with a non IPA related patch). The major problems is still the spills. Talking to Vlad I took at look at improving the Chaitin-Briggs heuristics for spilling during the presence of calls and how it tries to improve the allocation by moving spills along the call gaph. By improving on these heuristics I was able to reduce the number of spills and saw improvements on both x86 and AArch64 which brought us back to the old numbers. However this same information is used by other areas such as register preferences and so I had a regression in shrink wrapping. There's also an issue where x86 seems to assign negative values to register costs to indicate they REALLY want this register. This seems to work because the penalty applied usually is large and it cancels out the negative cost. But now the value stays negative causing the register to not be used instead. To fix these I need to keep track of the penalties and the costs separately but did not get time to finish that work before the end of stage-1.
[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782 --- Comment #7 from Jiangning Liu --- Without reverting the commit g:1118a3ff9d3ad6a64bba25dc01e7703325e23d92, we still see exchange2 performance issue for aarch64. BTW, we have been using -fno-inline-functions-called-once to get the best performance number for exchange2.
[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782 --- Comment #6 from Jan Hubicka --- I am sorry for getting back to this again late. This stage1 we spent some time with Martin improving the ipa-cp profile updating and looked again into the realism of the profile. Also recently the codegen has improved somewhat due to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103227 and due to modref propagation. I still believe for trunk GCC we should not have patch that intentionally makes profile unrealistic just to make IRA work better by accident since it does not seem to help anything real world except this somewhat odd benchmark. So I wonder if we can make profile to work better for IRA without actually making it unrealistic and tampering with ipa-cp cloning heuristics I added code that compares guessed profile with feedback https://gcc.gnu.org/pipermail/gcc-patches/2021-November/585599.html and also fixed/improved code to dump stats about profile updates https://gcc.gnu.org/pipermail/gcc-patches/2021-November/585578.html This gives bit more handle on how realistic the profile is. Answer is that not very in general, but at least for basic blocks containing calls it is not bad (we guess 0.9 while relity is 0.999). I am not sure how much better we can do statically since this is such a special case of backtracking. Last week we also noticed that with -Ofast we inline the newly produced clones together which makes IRA job a lot harder. This is done by -finline-functions-called-once and we tend to inline blocks of 2 or 3 clones leading to 18 or 27 nested loops in each. Simply disabling this optimization gets another performance hit. I filled in PR https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103454 and I think we could teach the inliner to not inline functions called once in large loop depths and restrict the large functions growths here since there are multiple benchmarks that now degrade on this. Worse yet, the heuristics for inlininig functions called once is not very smart and it depends on the order of cgrpah_nodes in the linked list which is bit random. I wonder how the situation looks on AArch64?