[Bug rtl-optimization/98782] [11 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782 --- Comment #43 from rsandifo at gcc dot gnu.org --- (In reply to hubicka from comment #42) > I also see a 6.69% regression on x64 with -Ofast -march=native -flto > https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=475.377.0 I can reproduce this with -Ofast -flto -march=znver3 (but not running on a Zen 3). It looks like it's due to g:d3ff7420e94 instead though (sorry Andre!). With a 3-iteration run, I see a 6.2% regression after that revision compared with before it. It would be great if someone more familiar than me with x86 could confirm the bisection though. > and perhaps 3-5% on sphinx > https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=476.280.0 > https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=227.280.0 I'll look at these next. > For non-spec benchmarks spec there is a regression on nbench > https://lnt.opensuse.org/db_default/v4/CPP/graph?plot.0=26.645.1 > There are also large changes in tsvc > https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report > it may be noise since kernels are tiny, but for example x293 reproduces > both on kabylake and zen by about 80-90% regression that may be easy to > track (the kernel is included in the testsuite). Same regression is not > seen on zen3, so may be an ISA specific or so. To summarise what we discussed on irc (for the record): it looks like the s293 regression is in the noise, like you say. I can't convince GCC to generate different code before and after the IRA patches for that. I haven't looked at the other tsvc tests yet.
[Bug rtl-optimization/98782] [11 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782 --- Comment #42 from hubicka at kam dot mff.cuni.cz --- on zen2 and 3 with -flto the speedup seems to be cca 12% for both -O2 and -Ofast -march=native which is both very nice! Zen1 for some reason sees less improvement, about 6%. With PGO it is 3.8% Overall it seems a win, but there are few noteworthy issues. I also see a 6.69% regression on x64 with -Ofast -march=native -flto https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=475.377.0 and perhaps 3-5% on sphinx https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=476.280.0 https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=227.280.0 For non-spec benchmarks spec there is a regression on nbench https://lnt.opensuse.org/db_default/v4/CPP/graph?plot.0=26.645.1 There are also large changes in tsvc https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report it may be noise since kernels are tiny, but for example x293 reproduces both on kabylake and zen by about 80-90% regression that may be easy to track (the kernel is included in the testsuite). Same regression is not seen on zen3, so may be an ISA specific or so. FInally there seems relatively large code size savings on polyhedron benchmarks today (8% on capacita, Thanks a lot!
[Bug rtl-optimization/98782] [11 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782 rsandifo at gcc dot gnu.org changed: What|Removed |Added Resolution|--- |FIXED Summary|[11/12 Regression] Bad |[11 Regression] Bad |interaction between IPA |interaction between IPA |frequences and IRA |frequences and IRA |resulting in spills due to |resulting in spills due to |changes in BB frequencies |changes in BB frequencies Status|ASSIGNED|RESOLVED --- Comment #41 from rsandifo at gcc dot gnu.org --- Hopefully fixed on trunk, much too invasive to backport. Thanks Vlad for the reviews.
[Bug rtl-optimization/98782] [11 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782 Richard Biener changed: What|Removed |Added Target Milestone|11.0|12.0 Priority|P3 |P2 --- Comment #5 from Richard Biener --- Re-targeting to GCC 12.
[Bug rtl-optimization/98782] [11 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782 Richard Biener changed: What|Removed |Added Target Milestone|--- |11.0
[Bug rtl-optimization/98782] [11 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782 --- Comment #4 from Jiangning Liu --- Hi Honza, Do you see any other real case problems if the patch g:1118a3ff9d3ad6a64bba25dc01e7703325e23d92 is not applied? If exchange2 is the only one affected by this patch so far, and because we have observed big performance regression, it sounds we need to provide an IRA fix along with this patch to avoid unexpected performance degradation for gcc11 release vs. gcc10. Thanks, -Jiangning
[Bug rtl-optimization/98782] [11 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782 Tamar Christina changed: What|Removed |Added CC||hubicka at gcc dot gnu.org Summary|IRA artificially creating |[11 Regression] Bad |spills due to BB|interaction between IPA |frequencies |frequences and IRA ||resulting in spills due to ||changes in BB frequencies --- Comment #3 from Tamar Christina --- Hi, Since we are in stage-4 I'd like to put all our ducks in a row and see what the options are at this point. IRA as you can imagine is huge and quite complex, the more I investigate the problem the more I realize that there isn't a spot fix for this issue. It will require a lot more work in IRA and understanding parts of it that I don't fully understand yet. But one thing is clear, there is a severe interaction between IPA predictions and IRA under conditions where there is high register pressure *and* a function call. The problem is that the changes introduced in g:1118a3ff9d3ad6a64bba25dc01e7703325e23d92 make local changes. i.e. they effect only some BB and not others. The problem is any spot fix in IRA would be a globally scoped. I was investigating whether the issue could be solved by having IRA treat the recursive inlined function in exchange2 as one region instead of going live range splitting. And yes using -fira-region=one does make a difference, but only a small difference of about 33% of the regression. However doing this has some disadvantage in that regions that before would not count in the live range of the call are now counted, so you regress spilling in those cases. This is why this flag can only recover 33% of the regression, it introduces some of it's own. The second alternative I tried as a spot fix is to be able to specify a weight for the CALL_FREQ for use during situations of high reg pressure and call live ranges. The "hack" looks like this: index 4fe019b2367..674e6ca7a48 100644 --- a/gcc/caller-save.c +++ b/gcc/caller-save.c @@ -425,6 +425,7 @@ setup_save_areas (void) || find_reg_note (insn, REG_NORETURN, NULL)) continue; freq = REG_FREQ_FROM_BB (BLOCK_FOR_INSN (insn)); + freq = freq * (param_ira_call_freq_weight / 100.f); REG_SET_TO_HARD_REG_SET (hard_regs_to_save, >live_throughout); used_regs = insn_callee_abi (insn).full_reg_clobbers (); diff --git a/gcc/ira-lives.c b/gcc/ira-lives.c index 4ba29dcadf4..6e2699e5a7d 100644 --- a/gcc/ira-lives.c +++ b/gcc/ira-lives.c @@ -1392,7 +1392,7 @@ process_bb_node_lives (ira_loop_tree_node_t loop_tree_node) it was saved on previous call in the same basic block and the hard register was not mentioned between the two calls. */ - ALLOCNO_CALL_FREQ (a) += freq / 3; + ALLOCNO_CALL_FREQ (a) += (freq * (param_ira_call_freq_weight / 100.0f)); diff --git a/gcc/params.opt b/gcc/params.opt index cfed980a4d2..39d5cae9f31 100644 --- a/gcc/params.opt +++ b/gcc/params.opt @@ -321,6 +321,11 @@ Max size of conflict table in MB. Common Joined UInteger Var(param_ira_max_loops_num) Init(100) Param Optimization Max loops number for regional RA. +-param=ira-call-freq-weight= +Common Joined UInteger Var(param_ira_call_freq_weight) Init(100) Param Optimization +Scale to be applied to the weighting of the frequencies of allocations live across +a call. + -param=iv-always-prune-cand-set-bound= Common Joined UInteger Var(param_iv_always_prune_cand_set_bound) Init(10) Param Optimization If number of candidates in the set is smaller, we always try to remove unused ivs during its optimization. And if we look at the changes in the frequency between the good and bad case the prediction changes approx 40%. So using the value of --param ira-call-freq-weight=40 recovers about 60% of the regression. The issue this global change introduce is however that IRA seems to start preferring callee-saves. Which is in itself not an issue, but at the boundary of a region it will then emit moves from temp to callee-saves to carry live values to the next region. This is completely unneeded, enabling the late register renaming pass (-frename-registers) removes these superfluous moves and we recover 66% of the regression. But this is just a big hack. The obvious disadvantage here, since again it's a global change is that it pushes all caller saves to be spilled before the function call. And indeed, before the recursive call there now is a massive amount of spilling happening. But it is something that would be "safe" to do at this point in the GCC development cycle. The last and preferred approach, if you agree