[Bug rtl-optimization/110587] [14 regression] 96% pr28071.c compile time regression since r14-2337-g37a231cc7594d1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110587 --- Comment #22 from CVS Commits --- The master branch has been updated by Richard Biener : https://gcc.gnu.org/g:b66e613a1a8d5b8fc9d8b03f7b60260700acf833 commit r14-3095-gb66e613a1a8d5b8fc9d8b03f7b60260700acf833 Author: Richard Biener Date: Tue Jul 25 15:36:30 2023 +0200 rtl-optimization/110587 - speedup find_hard_regno_for_1 The following applies a micro-optimization to find_hard_regno_for_1, re-ordering the check so we can easily jump-thread by using an else. This reduces the time spent in this function by 15% for the testcase in the PR. PR rtl-optimization/110587 * lra-assigns.cc (find_hard_regno_for_1): Re-order checks.
[Bug rtl-optimization/110587] [14 regression] 96% pr28071.c compile time regression since r14-2337-g37a231cc7594d1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110587 --- Comment #21 from Richard Biener --- (In reply to Uroš Bizjak from comment #20) > Can we revert the Comment #13 kludge now? When we revert it we get integrated RA : 0.42 ( 17%) 0.00 ( 0%) 0.43 ( 17%) 19M ( 16%) LRA non-specific : 0.39 ( 16%) 0.00 ( 0%) 0.39 ( 15%) 6304k ( 5%) LRA virtuals elimination : 0.03 ( 1%) 0.00 ( 0%) 0.02 ( 1%) 3729k ( 3%) LRA reload inheritance : 0.17 ( 7%) 0.01 ( 10%) 0.18 ( 7%) 5109k ( 4%) LRA create live ranges : 0.27 ( 11%) 0.00 ( 0%) 0.28 ( 11%) 984k ( 1%) LRA hard reg assignment: 0.72 ( 30%) 0.01 ( 10%) 0.74 ( 29%) 0 ( 0%) TOTAL : 2.43 0.10 2.54 123M so the regression is back and also code size increases significantly.
[Bug rtl-optimization/110587] [14 regression] 96% pr28071.c compile time regression since r14-2337-g37a231cc7594d1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110587 --- Comment #20 from Uroš Bizjak --- Can we revert the Comment #13 kludge now?
[Bug rtl-optimization/110587] [14 regression] 96% pr28071.c compile time regression since r14-2337-g37a231cc7594d1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110587 Richard Biener changed: What|Removed |Added Resolution|--- |FIXED Status|NEW |RESOLVED --- Comment #19 from Richard Biener --- The tester shows the issue is fixed now (we're faster than before the regression). At -O0 compile-time is still dominated by RA (r14-2920-g07b7cd70399d22, release checking): integrated RA : 0.29 ( 32%) LRA non-specific : 0.15 ( 16%) TOTAL : 0.91 Samples: 3K of event 'cycles:u', Event count (approx.): 5038659855 Overhead Samples Command Shared Object Symbol 6.15% 233 cc1 cc1 [.] process_alt_operands 4.29% 163 cc1 cc1 [.] process_bb_node_lives 3.72% 142 cc1 cc1 [.] record_reg_classes 3.01% 114 cc1 cc1 [.] mark_ref_dead 2.87% 109 cc1 cc1 [.] constrain_operands 2.71% 114 cc1 cc1 [.] df_ref_create_structure 2.47%94 cc1 cc1 [.] ira_setup_alts
[Bug rtl-optimization/110587] [14 regression] 96% pr28071.c compile time regression since r14-2337-g37a231cc7594d1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110587 --- Comment #18 from CVS Commits --- The master branch has been updated by Richard Biener : https://gcc.gnu.org/g:07b7cd70399d22c113ad8bb1eff5cc2d12973d33 commit r14-2920-g07b7cd70399d22c113ad8bb1eff5cc2d12973d33 Author: Richard Biener Date: Tue Jul 25 15:32:11 2023 +0200 rtl-optimization/110587 - remove quadratic regno_in_use_p The following removes the code checking whether a noop copy is between something involved in the return sequence composed of a SET and USE. Instead of checking for this special-case the following makes us only ever remove noop copies between pseudos - which is the case that is necessary for IRA/LRA interfacing to function according to the comment. That makes looking for the return reg special case unnecessary, reducing the compile-time in LRA non-specific to zero for the testcase. PR rtl-optimization/110587 * lra-spills.cc (return_regno_p): Remove. (regno_in_use_p): Likewise. (lra_final_code_change): Do not remove noop moves between hard registers.
[Bug rtl-optimization/110587] [14 regression] 96% pr28071.c compile time regression since r14-2337-g37a231cc7594d1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110587 --- Comment #17 from CVS Commits --- The master branch has been updated by Roger Sayle : https://gcc.gnu.org/g:095eb138f736d94dabf9a07a6671bd351be0e66a commit r14-2851-g095eb138f736d94dabf9a07a6671bd351be0e66a Author: Roger Sayle Date: Fri Jul 28 09:39:46 2023 +0100 PR rtl-optimization/110587: Reduce useless moves in compile-time hog. This patch is one of a series of fixes for PR rtl-optimization/110587, a compile-time regression with -O0, that attempts to address the underlying cause. As noted previously, the pathological test case pr28071.c contains a large number of useless register-to-register moves that can produce quadratic behaviour (in LRA). These moves are generated during RTL expansion in emit_group_load_1, where the middle-end attempts to simplify the source before calling extract_bit_field. This is reasonable if the source is a complex expression (from before the tree-ssa optimizers), or a SUBREG, or a hard register, but it's not particularly useful to copy a pseudo register into a new pseudo register. This patch eliminates that redundancy. The -fdump-tree-expand for pr28071.c compiled with -O0 currently contains 777K lines, with this patch it contains 717K lines, i.e. saving about 60K lines (admittedly of debugging text output, but it makes the point). 2023-07-28 Roger Sayle Richard Biener gcc/ChangeLog PR middle-end/28071 PR rtl-optimization/110587 * expr.cc (emit_group_load_1): Simplify logic for calling force_reg on ORIG_SRC, to avoid making a copy if the source is already in a pseudo register.
[Bug rtl-optimization/110587] [14 regression] 96% pr28071.c compile time regression since r14-2337-g37a231cc7594d1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110587 Roger Sayle changed: What|Removed |Added Assignee|roger at nextmovesoftware dot com |unassigned at gcc dot gnu.org --- Comment #16 from Roger Sayle --- My patch (in comment #15) is obsoleted by Richard Biener's much better solution(s): https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625416.html https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625417.html
[Bug rtl-optimization/110587] [14 regression] 96% pr28071.c compile time regression since r14-2337-g37a231cc7594d1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110587 --- Comment #15 from Roger Sayle --- Hi Richard, There's another patch awaiting review at https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625282.html and I've another follow-up after that currently regression testing...
[Bug rtl-optimization/110587] [14 regression] 96% pr28071.c compile time regression since r14-2337-g37a231cc7594d1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110587 --- Comment #14 from Richard Biener --- compile-time is back to the first jump caused by r14-2337-g37a231cc7594d1, thanks Roger. We still have LRA non-specific : 3.53 ( 75%) at -O0 here which Rogers followup patch will improve (but not generally solve the issue). At -O1 combine dominates, at -O2 we see other parts of RA being slow: integrated RA : 7.10 ( 23%) LRA non-specific : 1.56 ( 5%) LRA virtuals elimination : 0.07 ( 0%) LRA reload inheritance : 1.02 ( 3%) LRA create live ranges : 0.88 ( 3%) LRA hard reg assignment: 8.22 ( 27%) LRA coalesce pseudo regs : 0.00 ( 0%) LRA rematerialization : 0.18 ( 1%) Samples: 124K of event 'cycles:u', Event count (approx.): 164730867020 Overhead Samples Command Shared Object Symbol 16.60% 20660 cc1 cc1 [.] find_hard_regno_for_1 11.90% 14742 cc1 cc1 [.] bitmap_set_bit 6.47% 7973 cc1 cc1 [.] color_allocnos 3.31% 4023 cc1 cc1 [.] bitmap_bit_p 3.07% 3791 cc1 cc1 [.] remove_allocno_from_bucket_and_push 2.77% 3435 cc1 cc1 [.] assign_hard_reg 2.54% 3138 cc1 cc1 [.] ira_build_conflicts in find_hard_regno_for_1 the loop over live ranges is what's costly, esp. because it seems the conditionals in the loops depend on (indirect) memory and that no longer fits nicely into caches. Maybe regno_allocno_class_array can be shrunk from 'enum reg_class' (unsigned int) to something smaller. It looks like this array is a memory optimization since reg_allocno_class would perform a much sparser access.
[Bug rtl-optimization/110587] [14 regression] 96% pr28071.c compile time regression since r14-2337-g37a231cc7594d1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110587 --- Comment #13 from CVS Commits --- The master branch has been updated by Roger Sayle : https://gcc.gnu.org/g:8125b12f846b41f26e58c0fe3b218d654f65d1c8 commit r14-2730-g8125b12f846b41f26e58c0fe3b218d654f65d1c8 Author: Roger Sayle Date: Sat Jul 22 21:52:55 2023 +0100 i386: Don't use insvti_{high,low}part with -O0 (for compile-time). This patch attempts to help with PR rtl-optimization/110587, a regression of -O0 compile time for the pathological pr28071.c. My recent patch helps a bit, but hasn't returned -O0 compile-time to where it was before my ix86_expand_move changes. The obvious solution/workaround is to guard these new TImode parameter passing optimizations with "&& optimize", so they don't trigger when compiling with -O0. The very minor complication is that "&& optimize" alone leads to the regression of pr110533.c, where our improved TImode parameter passing fixes a wrong-code issue with naked functions, importantly, when compiling with -O0. This should explain the one line fix below "&& (optimize || ix86_function_naked (cfun))". I've an additional fix/tweak or two for this compile-time issue, but this change eliminates the part of the regression that I've caused. 2023-07-22 Roger Sayle gcc/ChangeLog * config/i386/i386-expand.cc (ix86_expand_move): Disable the 64-bit insertions into TImode optimizations with -O0, unless the function has the "naked" attribute (for PR target/110533).
[Bug rtl-optimization/110587] [14 regression] 96% pr28071.c compile time regression since r14-2337-g37a231cc7594d1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110587 --- Comment #12 from Richard Biener --- This code block has a rich history with many fixes for many issues :/ (I thought of just scrapping it ...), still regno_in_use_p is badly engineered in this context. Of course we're quite unlucky that the return REG is in use that much for this large BB. In the end the reason why this code exists and also some of the fallout observed in the history point at issues that might be worth fixing elsewhere as well.
[Bug rtl-optimization/110587] [14 regression] 96% pr28071.c compile time regression since r14-2337-g37a231cc7594d1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110587 Roger Sayle changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |roger at nextmovesoftware dot com --- Comment #11 from Roger Sayle --- My (upcoming) patch for PR88873 dramatically reduces the compile-time (with -O0) for this test case (by reducing the number of pseudos and reducing the number of reloads). But don't let that stop anyone from speeding up lra_final_code_change.
[Bug rtl-optimization/110587] [14 regression] 96% pr28071.c compile time regression since r14-2337-g37a231cc7594d1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110587 --- Comment #10 from Richard Biener --- I wonder what the following does anyway. We delete the noop move only when either the reg isn't used for return or it isn't in use in later insns between 'insn' and the next set of it. That seems to detect the hardreg = X; USE (hardreg); return sequence and wants to protect that despite X being the same as 'hardreg'. /* IRA can generate move insns involving pseudos. It is better remove them earlier to speed up compiler a bit. It is also better to do it here as they might not pass final RTL check in LRA, (e.g. insn moving a control register into itself). So remove an useless move insn unless next insn is USE marking the return reg (we should save this as some subsequent optimizations assume that such original insns are saved). */ if (NONJUMP_INSN_P (insn) && GET_CODE (pat) == SET && REG_P (SET_SRC (pat)) && REG_P (SET_DEST (pat)) && REGNO (SET_SRC (pat)) == REGNO (SET_DEST (pat)) && (! return_regno_p (REGNO (SET_SRC (pat))) || ! regno_in_use_p (insn, REGNO (SET_SRC (pat) what's odd is of course that return_regno_p returns true so much for this testcase. The return sequence to protect should be easily discoverable by walking from the function exit and thus could be marked instead of trying to match it to each insn like above. But I don't understand why we want to preserve this noop copy anyway ...
[Bug rtl-optimization/110587] [14 regression] 96% pr28071.c compile time regression since r14-2337-g37a231cc7594d1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110587 Roger Sayle changed: What|Removed |Added CC||roger at nextmovesoftware dot com See Also||https://gcc.gnu.org/bugzill ||a/show_bug.cgi?id=88873 --- Comment #9 from Roger Sayle --- I'll check whether turning off the insvti_{low,high}part transformations during lra_in_progress helps compile-time. I believe everytime reload encounters a TI<->SSE SUBREG, the spill/reload generates two or three additional instructions. I'm thinking that perhaps this should ideally be an UNSPEC, that we can split after reload. As shown in PR 88873, we'd like SSE->TI->SSE to avoid going via memory [where currently this happens twice]. It looks like "interval" in pr28071.c suffers from the same x86 ABI issues [i.e. is placed in scalar TImode, where ideally we'd like V2DI].
[Bug rtl-optimization/110587] [14 regression] 96% pr28071.c compile time regression since r14-2337-g37a231cc7594d1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110587 Richard Biener changed: What|Removed |Added CC||vmakarov at gcc dot gnu.org --- Comment #8 from Richard Biener --- Btw, with GCC 13.1 this is already a LRA hog: LRA non-specific : 3.31 ( 73%) 0.01 ( 9%) 3.33 ( 72%) 3876k ( 3%) TOTAL : 4.53 0.11 4.65 126M GCC 8 and before were worse. On trunk: LRA non-specific : 6.22 ( 69%) 0.02 ( 20%) 6.22 ( 69%) 8922k ( 6%) LRA hard reg assignment: 1.00 ( 11%) 0.02 ( 20%) 1.02 ( 11%) 0 ( 0%) TOTAL : 8.97 0.10 9.08 149M the above is with just -O0. Profile: Samples: 37K of event 'cycles:u', Event count (approx.): 49984847870 Overhead Samples Command Shared Object Symbol 51.58% 19087 cc1 cc1 [.] lra_final_code_change 11.10% 4106 cc1 cc1 [.] next_nondebug_insn 7.61% 2879 cc1 cc1 [.] bitmap_set_bit 6.42% 2425 cc1 cc1 [.] find_hard_regno_for_1 2.28% 842 cc1 cc1 [.] bitmap_bit_p 0.99% 365 cc1 cc1 [.] lra_create_live_ranges_1 it possibly means we now spill more, at -O0 at least. We have a 10% regression in assembly line count between 13 and trunk. The main hog in lra_final_code_change is calls to regno_in_use_p and the loop within that. The BB in this function is _huge_ so the whole process quickly becomes quadratic. Maybe the whole thing should work backwards on a BB and this info collected on-the-fly as some "liveness" problem?
[Bug rtl-optimization/110587] [14 regression] 96% pr28071.c compile time regression since r14-2337-g37a231cc7594d1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110587 Martin Jambor changed: What|Removed |Added CC||sayle at gcc dot gnu.org --- Comment #7 from Martin Jambor --- Oops sorry, indeed, the much bigger regression is because of: commit 8911879415d6c2a7baad88235554a912887a1c5c Author: Roger Sayle Date: Fri Jul 14 18:10:05 2023 +0100 i386: Improved insv of DImode/DFmode {high,low}parts into TImode. This is the next piece towards a fix for (the x86_64 ABI issues affecting) PR 88873. This patch generalizes the recent tweak to ix86_expand_move for setting the highpart of a TImode reg from a DImode source using *insvti_highpart_1, to handle both DImode and DFmode sources, and also use the recently added *insvti_lowpart_1 for setting the lowpart. Although this is another intermediate step (not yet a fix), towards enabling *insvti and *concat* patterns to be candidates for TImode STV (by using V2DI/V2DF instructions), it already improves things a little. [...]
[Bug rtl-optimization/110587] [14 regression] 96% pr28071.c compile time regression since r14-2337-g37a231cc7594d1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110587 Richard Biener changed: What|Removed |Added CC||rguenth at gcc dot gnu.org --- Comment #6 from Richard Biener --- That doesn't seem to be the larger jump at Jul 16/17? Can we bisect that as well?
[Bug rtl-optimization/110587] [14 regression] 96% pr28071.c compile time regression since r14-2337-g37a231cc7594d1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110587 --- Comment #5 from Martin Jambor --- (In reply to Hongtao.liu from comment #3) > I can't find pr28071.c in GCC testsuite, but find an attached source file in > the PR #c1, is that pr28071.c you means? Yes. (In reply to Hongtao.liu from comment #4) > (In reply to Jan Hubicka from comment #0) > > Seen here: > > https://lnt.opensuse.org/db_default/v4/CPP/graph?plot.0=288.597.8 > > https://lnt.opensuse.org/db_default/v4/CPP/graph?plot.0=468.597.8 > > https://lnt.opensuse.org/db_default/v4/CPP/graph?plot.0=172.597.8 > > Also is O0_g means compile flag is -O0 -g? That is what I used to bisect, although I *think* that -g is not necessary.
[Bug rtl-optimization/110587] [14 regression] 96% pr28071.c compile time regression since r14-2337-g37a231cc7594d1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110587 --- Comment #4 from Hongtao.liu --- (In reply to Jan Hubicka from comment #0) > Seen here: > https://lnt.opensuse.org/db_default/v4/CPP/graph?plot.0=288.597.8 > https://lnt.opensuse.org/db_default/v4/CPP/graph?plot.0=468.597.8 > https://lnt.opensuse.org/db_default/v4/CPP/graph?plot.0=172.597.8 Also is O0_g means compile flag is -O0 -g?
[Bug rtl-optimization/110587] [14 regression] 96% pr28071.c compile time regression since r14-2337-g37a231cc7594d1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110587 Hongtao.liu changed: What|Removed |Added CC||crazylht at gmail dot com --- Comment #3 from Hongtao.liu --- I can't find pr28071.c in GCC testsuite, but find an attached source file in the PR #c1, is that pr28071.c you means?
[Bug rtl-optimization/110587] [14 regression] 96% pr28071.c compile time regression since r14-2337-g37a231cc7594d1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110587 Andrew Pinski changed: What|Removed |Added Target||x86_64-linux-gnu --- Comment #2 from Andrew Pinski --- Would be interesting to see if it is the register allocator and where (which function) in GCC the compile time slow down happens.