[Bug rtl-optimization/92264] [10 Regression] Compile time hog in 521.wrf_r with -Ofast -march=znver2 -g since r276318
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92264 --- Comment #47 from CVS Commits --- The master branch has been updated by Jeff Law : https://gcc.gnu.org/g:14162197fd4cf0e3a848d32bd9385876e1b1f249 commit r10-7609-g14162197fd4cf0e3a848d32bd9385876e1b1f249 Author: Jeff Law Date: Tue Apr 7 17:55:00 2020 -0600 Fix a variety of testsuite failures on the H8 after recent cselib changes PR rtl-optimization/92264 * config/h8300/h8300.md (mov;add peephole2): Avoid applying when the destination is the stack pointer.
[Bug rtl-optimization/92264] [10 Regression] Compile time hog in 521.wrf_r with -Ofast -march=znver2 -g since r276318
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92264 --- Comment #46 from CVS Commits --- The master branch has been updated by Jeff Law : https://gcc.gnu.org/g:7f26e60c2600f0a676fc9370390ebf0a3c78f31c commit r10-7548-g7f26e60c2600f0a676fc9370390ebf0a3c78f31c Author: Jeff Law Date: Fri Apr 3 17:47:18 2020 -0600 Fix stdarg-3 regression on xstormy16 port PR rtl-optimization/92264 * config/stormy16/stormy16.c (xstormy16_preferred_reload_class): Handle reloading of auto-increment addressing modes.
[Bug rtl-optimization/92264] [10 Regression] Compile time hog in 521.wrf_r with -Ofast -march=znver2 -g since r276318
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92264 --- Comment #45 from CVS Commits --- The master branch has been updated by Jeff Law : https://gcc.gnu.org/g:b949f8e2acb49273b2f08ecaa3bc7128baaad850 commit r10-7544-gb949f8e2acb49273b2f08ecaa3bc7128baaad850 Author: Jeff Law Date: Fri Apr 3 12:46:13 2020 -0600 Fix va-arg-22.c at -O1 on m32r. PR rtl-optimization/92264 * config/m32r/m32r.c (m32r_output_block_move): Properly account for post-increment addressing of source operands as well as residuals when computing any adjustments to the input pointer.
[Bug rtl-optimization/92264] [10 Regression] Compile time hog in 521.wrf_r with -Ofast -march=znver2 -g since r276318
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92264 Jakub Jelinek changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #44 from Jakub Jelinek --- Fixed then.
[Bug rtl-optimization/92264] [10 Regression] Compile time hog in 521.wrf_r with -Ofast -march=znver2 -g since r276318
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92264 --- Comment #43 from Martin Liška --- (In reply to Jakub Jelinek from comment #42) > Is that good enough to mark this PR as resolved? In #c0 you said before > Richard's change it took ~200s, which is more than 2m21s, though it is > unclear if those 141s are with checking compiler or not. These 140 are with default checking for current master which is checking enabled. That said, I would close this.
[Bug rtl-optimization/92264] [10 Regression] Compile time hog in 521.wrf_r with -Ofast -march=znver2 -g since r276318
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92264 --- Comment #42 from Jakub Jelinek --- Is that good enough to mark this PR as resolved? In #c0 you said before Richard's change it took ~200s, which is more than 2m21s, though it is unclear if those 141s are with checking compiler or not.
[Bug rtl-optimization/92264] [10 Regression] Compile time hog in 521.wrf_r with -Ofast -march=znver2 -g since r276318
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92264 --- Comment #41 from Martin Liška --- The current master does: $ time gfortran module_configure.fppized.f90 -c -march=znver2 -std=legacy -fconvert=big-endian -fno-openmp -Ofast -march=znver2 -g ... real2m21.190s user2m20.487s sys 0m0.647s
[Bug rtl-optimization/92264] [10 Regression] Compile time hog in 521.wrf_r with -Ofast -march=znver2 -g since r276318
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92264 --- Comment #40 from CVS Commits --- The master branch has been updated by Jakub Jelinek : https://gcc.gnu.org/g:86c924113208f58fdda24078c9cc9285ee8000cd commit r10-7516-g86c924113208f58fdda24078c9cc9285ee8000cd Author: Jakub Jelinek Date: Thu Apr 2 14:34:42 2020 +0200 params: Decrease -param=max-find-base-term-values= default [PR92264] For the PR in question, my proposal would be to also lower -param=max-find-base-term-values= default from 2000 to 200 after this, at least in the above 4 bootstraps/regtests there is nothing that would ever result in find_base_term returning non-NULL with more than 200 VALUEs being processed. 2020-04-02 Jakub Jelinek PR rtl-optimization/92264 * params.opt (-param=max-find-base-term-values=): Decrease default from 2000 to 200.
[Bug rtl-optimization/92264] [10 Regression] Compile time hog in 521.wrf_r with -Ofast -march=znver2 -g since r276318
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92264 --- Comment #39 from CVS Commits --- The master branch has been updated by Jakub Jelinek : https://gcc.gnu.org/g:2c0fa3ecf70d199af18785702e9e0548fd3ab793 commit r10-7515-g2c0fa3ecf70d199af18785702e9e0548fd3ab793 Author: Jakub Jelinek Date: Thu Apr 2 14:28:14 2020 +0200 cselib: Reuse VALUEs on sp adjustments [PR92264] As discussed in the PR, if !ACCUMULATE_OUTGOING_ARGS on large functions we can have hundreds of thousands of stack pointer adjustments and cselib creates a new VALUE after each sp adjustment, which form extremely deep VALUE chains, which is very harmful e.g. for find_base_term. E.g. if we have sp -= 4 sp -= 4 sp += 4 sp += 4 sp -= 4 sp += 4 that means 7 VALUEs, one for the sp at beginning (val1), than val2 = val1 - 4, then val3 = val2 - 4, then val4 = val3 + 4, then val5 = val4 + 4, then val6 = val5 - 4, then val7 = val6 + 4. This patch tweaks cselib, so that it is smarter about sp adjustments. When cselib_lookup (stack_pointer_rtx, Pmode, 1, VOIDmode) and we know nothing about sp yet (this happens at the start of the function, for non-var-tracking also after cselib_reset_table and for var-tracking after processing fp_setter insn where we forget about former sp values because that is now hfp related while everything after it is sp related), we look it up normally, but in addition to what we have been doing before we mark the VALUE as SP_DERIVED_VALUE_P. Further lookups of sp + offset are then special cased, so that it is canonicalized to that SP_DERIVED_VALUE_P VALUE + CONST_INT (if possible). So, for the above, we get val1 with SP_DERIVED_VALUE_P set, then val2 = val1 - 4, val3 = val1 - 8 (note, no longer val2 - 4!), then we get val2 again, val1 again, val2 again, val1 again. In the find_base_term visited_vals.length () > 100 find_base_term statistics during combined x86_64-linux and i686-linux bootstrap+regtest cycle, without the patch I see: find_base_term > 100 returning NULL returning non-NULL 32-bit compilations 4229178 407 64-bit compilations 217523 0 with largest visited_vals.length () when returning non-NULL being 206. With the patch the same numbers are: 32-bit compilations 1249588 135 64-bit compilations 35100 with largest visited_vals.length () when returning non-NULL being 173. This shows significant reduction of the deep VALUE chains. On powerpc64{,le}-linux, these stats didn't change at all, we have 10080 for all of -m32, -m64 and little-endian -m64, just the gcc.dg/pr85180.c and gcc.dg/pr87985.c testcases which are unrelated to sp. My earlier version of the patch, which contained just the rtl.h and cselib.c changes, regressed some tests: gcc.dg/guality/{pr36728-{1,3},pr68860-{1,2}}.c gcc.target/i386/{pr88416,sse-{13,23,24,25,26}}.c The problem with the former tests was worse debug info, where with -m32 where arg7 was passed in a stack slot we though a push later on might have invalidated it, when it couldn't. This is something I've solved with the var-tracking.c (vt_initialize) changes. In those problematic functions, we create a cfa_base VALUE (argp) and want to record that at the start of the function the argp VALUE is sp + off and also record that current sp VALUE is argp's VALUE - off. The second permanent equivalence didn't make it after the patch though, because cselib_add_permanent_equiv will cselib_lookup the value of the expression it wants to add as the equivalence and if it is the same VALUE as we are calling it on, it doesn't do anything; and due to the cselib changes for sp based accesses that is exactly what happened. By reversing the order of the cselib_add_permanent_equiv calls we get both equivalences though and thus are able to canonicalize the sp based accesses in var-tracking to the cfa_base value + offset. The i386 FAILs were all ICEs, where we had pushf instruction pushing flags and then pop pseudo reading that value again. With the cselib changes, cselib during RTL DSE is able to see through the sp adjustment and wanted to replace_read what was done pushf, by moving the flags register into a pseudo and replace the memory read in the pop with that pseudo. That is wrong for two reasons: one is that the backend doesn't have an instruction to move the flags hard register into some other register, but replace_read has been validating just the mem -> pseudo replacement and not the insns emitted by copy_to_mode_reg. And the second issue is that it is obviously wrong to replace a stack pop which contains stack post-increment by a copy of pseudo into destination.
[Bug rtl-optimization/92264] [10 Regression] Compile time hog in 521.wrf_r with -Ofast -march=znver2 -g since r276318
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92264 --- Comment #38 from Jakub Jelinek --- No, far from it.
[Bug rtl-optimization/92264] [10 Regression] Compile time hog in 521.wrf_r with -Ofast -march=znver2 -g since r276318
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92264 --- Comment #37 from Martin Liška --- @Jakub: Can we close it? Or do you plan any other patch for it?
[Bug rtl-optimization/92264] [10 Regression] Compile time hog in 521.wrf_r with -Ofast -march=znver2 -g since r276318
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92264 --- Comment #36 from CVS Commits --- The master branch has been updated by Jakub Jelinek : https://gcc.gnu.org/g:9708ca2be40399d6266bc85c99e085e3fe27a809 commit r10-7392-g9708ca2be40399d6266bc85c99e085e3fe27a809 Author: Jakub Jelinek Date: Thu Mar 26 09:15:39 2020 +0100 var-tracking: Mark as sp based more VALUEs [PR92264] With this simple patch, on i686-linux and x86_64-linux with -m32 (no change for -m64), the find_base_term visited_vals.length () > 100 find_base_term statistics changed (fbt is before this patch, fbt2 with this patch): for k in '' '1$'; do for i in 32 64; do for j in fbt fbt2; do \ echo -n "$j $i $k "; LC_ALL=C grep ^$i.*"$k" $j | wc -l; done; done; done fbt 32 5313355 fbt2 32 4229854 fbt 64 217523 fbt2 64 217523 fbt 32 1$ 1296 fbt2 32 1$ 407 fbt 64 1$ 0 fbt2 64 1$ 0 For frame_pointer_needed functions, we need to wait until we see the fp_setter insn in the prologue at which point we disassociate the fp based VALUEs from sp based VALUEs, but for !frame_pointer_needed functions, we IMHO don't need to wait anything. For ACCUMULATE_OUTGOING_ARGS it isn't IMHO worth doing anything, as there is a single sp adjustment and so there is no risk of many thousands deep VALUE chains, but for !ACCUMULATE_OUTGOING_ARGS the sp keeps changing constantly. 2020-03-26 Jakub Jelinek PR rtl-optimization/92264 * var-tracking.c (add_stores): Call cselib_set_value_sp_based even for sp based values in !frame_pointer_needed && !ACCUMULATE_OUTGOING_ARGS functions.
[Bug rtl-optimization/92264] [10 Regression] Compile time hog in 521.wrf_r with -Ofast -march=znver2 -g since r276318
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92264 --- Comment #35 from Martin Liška --- The suggested patch helped to reduce compilation to: 2m54s
[Bug rtl-optimization/92264] [10 Regression] Compile time hog in 521.wrf_r with -Ofast -march=znver2 -g since r276318
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92264 --- Comment #34 from Jakub Jelinek --- Created attachment 48081 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48081=edit gcc10-pr92264-wip.patch Further updated patch, this one passes bootstrap on both x86_64-linux and i686-linux and according to the gathered statistics results in significantly fewer cases of visited_vals.length () > 100 in find_base_term, together with the var-tracking.c small change posted to gcc-patches from 5530878 to 1253101 cases. But, so far it causes a few regressions: gcc.dg/guality/pr36728-{1,3}.c and gcc.dg/guality/pr68860-{1,2}.c at various optimization levels and ICEs on gcc.target/i386/pr88416.c and gcc.target/i386/sse-{13,23,24,25,26}.c. So far I've looked at pr88416.c and the problem is that DSE now on: (insn 10 9 11 2 (set (mem:DI (pre_dec:DI (reg/f:DI 7 sp)) [0 S8 A8]) (reg:DI 17 flags)) "include/ia32intrin.h":262:10 56 {*pushfldi2} (expr_list:REG_DEAD (reg:DI 17 flags) (nil))) (insn 11 10 29 2 (set (reg:DI 85) (mem:DI (post_inc:DI (reg/f:DI 7 sp)) [0 S8 A8])) "include/ia32intrin.h":262:10 52 {*popdi1} (nil)) "optimizes" it into: (insn 31 9 10 2 (set (reg:DI 89) (reg:DI 17 flags)) "include/ia32intrin.h":262:10 -1 (nil)) (insn 10 31 11 2 (set (mem:DI (pre_dec:DI (reg/f:DI 7 sp)) [0 S8 A8]) (reg:DI 17 flags)) "include/ia32intrin.h":262:10 56 {*pushfldi2} (expr_list:REG_DEAD (reg:DI 17 flags) (nil))) (insn 11 10 29 2 (set (reg:DI 85) (reg:DI 89)) "include/ia32intrin.h":262:10 66 {*movdi_internal} (nil)) This is wrong for 2 reasons. The newly added insn 31 isn't recognized and DSE doesn't bother trying to recognize it, and when the pop is optimized away into just reg to reg assignment, the side-effect (post-increment of sp) is lost. Wonder if DSE shouldn't punt for autoinc (don't remove stores that have them) or say try to replace the autoinc with register update, and whether it shouldn't try to recog new insns it creates. As for the guality failures, I've so far briefly looked only at the first one, where we end up with an extra +(note 84 53 54 2 (var_location arg7 (nil)) NOTE_INSN_VAR_LOCATION) that wasn't there before. Will try to understand this one.
[Bug rtl-optimization/92264] [10 Regression] Compile time hog in 521.wrf_r with -Ofast -march=znver2 -g since r276318
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92264 --- Comment #33 from Jakub Jelinek --- Created attachment 48075 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48075=edit gcc10-pr92264-wip.patch Updated patch, which doesn't ICE anymore, and creates 10500 instead of 12000 VALUEs during var-tracking on diagnostic-show-location.c, which should be a good thing. But the .s file is smaller and in the *.vartrack dump I see some NOTE_INSN_VAR_LOCATIONs missing, so apparently it does have bad impact on debug info quality.
[Bug rtl-optimization/92264] [10 Regression] Compile time hog in 521.wrf_r with -Ofast -march=znver2 -g since r276318
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92264 --- Comment #32 from Jakub Jelinek --- Some incremental progress, but still ICEs... --- gcc/cselib.c2020-03-20 17:42:02.333023994 +0100 +++ gcc/cselib.c2020-03-20 19:23:33.506622424 +0100 @@ -58,6 +58,16 @@ static void cselib_invalidate_mem (rtx); static void cselib_record_set (rtx, cselib_val *, cselib_val *); static void cselib_record_sets (rtx_insn *); +static rtx autoinc_split (rtx, rtx *, machine_mode); + +#define PRESERVED_VALUE_P(RTX) \ + (RTL_FLAG_CHECK1 ("PRESERVED_VALUE_P", (RTX), VALUE)->unchanging) + +#define SP_BASED_VALUE_P(RTX) \ + (RTL_FLAG_CHECK1 ("SP_BASED_VALUE_P", (RTX), VALUE)->jump) + +#define SP_DERIVED_VALUE_P(RTX) \ + (RTL_FLAG_CHECK1 ("SP_DERIVED_VALUE_P", (RTX), VALUE)->call) struct expand_value_data { @@ -122,6 +132,13 @@ if (GET_CODE (x) == VALUE) return x == v->val_rtx; + if (SP_DERIVED_VALUE_P (v->val_rtx) && GET_MODE (x) == Pmode) +{ + rtx xoff = NULL; + if (autoinc_split (x, , memmode) == v->val_rtx && xoff == NULL_RTX) + return true; +} + /* We don't guarantee that distinct rtx's have different hash values, so we need to do a comparison. */ for (l = v->locs; l; l = l->next) @@ -256,15 +273,6 @@ void (*cselib_record_sets_hook) (rtx_insn *insn, struct cselib_set *sets, int n_sets); -#define PRESERVED_VALUE_P(RTX) \ - (RTL_FLAG_CHECK1 ("PRESERVED_VALUE_P", (RTX), VALUE)->unchanging) - -#define SP_BASED_VALUE_P(RTX) \ - (RTL_FLAG_CHECK1 ("SP_BASED_VALUE_P", (RTX), VALUE)->jump) - -#define SP_DERIVED_VALUE_P(RTX) \ - (RTL_FLAG_CHECK1 ("SP_DERIVED_VALUE_P", (RTX), VALUE)->call) - /* Allocate a struct elt_list and fill in its two elements with the @@ -497,7 +505,7 @@ }; cselib_val **slot = cselib_preserved_hash_table->find_slot_with_hash (, - v->hash, INSERT); + v->hash, INSERT); gcc_assert (!*slot); *slot = v; } @@ -535,6 +543,28 @@ max_value_regs = hard_regno_nregs (regno, GET_MODE (cfa_base_preserved_val->locs->loc)); + + /* If cfa_base is sp + const_int, need to preserve also the +SP_DERIVED_VALUE_P value. */ + for (struct elt_loc_list *l = cfa_base_preserved_val->locs; + l; l = l->next) + if (GET_CODE (l->loc) == PLUS + && GET_CODE (XEXP (l->loc, 0)) == VALUE + && SP_DERIVED_VALUE_P (XEXP (l->loc, 0)) + && CONST_INT_P (XEXP (l->loc, 1))) + { + if (! invariant_or_equiv_p (CSELIB_VAL_PTR (XEXP (l->loc, 0 + { + rtx val = cfa_base_preserved_val->val_rtx; + rtx_insn *save_cselib_current_insn = cselib_current_insn; + cselib_current_insn = l->setting_insn; + new_elt_loc_list (CSELIB_VAL_PTR (XEXP (l->loc, 0)), + plus_constant (Pmode, val, +-UINTVAL (XEXP (l->loc, 1; + cselib_current_insn = save_cselib_current_insn; + } + break; + } } else { @@ -544,8 +574,7 @@ } if (cselib_preserve_constants) -cselib_hash_table->traverse - (NULL); +cselib_hash_table->traverse (NULL); else { cselib_hash_table->empty ();
[Bug rtl-optimization/92264] [10 Regression] Compile time hog in 521.wrf_r with -Ofast -march=znver2 -g since r276318
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92264 --- Comment #31 from Jakub Jelinek --- Created attachment 48073 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48073=edit gcc10-pr92264-wip.patch WIP patch to try special casing cselib handling of stack pointer VALUEs. The intent is that when we first cselib_lookup the stack pointer (e.g. at the start of a bb), we mark the created VALUE as SP_DERIVED_VALUE_P and when we lookup sp + CONST_INT, we try to canonicalize that lookup into lookup of that VALUE + adjusted offset. Grepping e.g. -fdump-rtl-postreload-cselib or -fdump-rtl-dse2-cselib for ^cselib.value shows that with -O2 -m32 we create significantly fewer new VALUEs as we can share the VALUEs that are that SP_DERIVED_VALUE_P VALUE + offset. Unfortunately, this ICEs during var-tracking due to VALUE canonicalization and PRESERVED_VALUE_P.
[Bug rtl-optimization/92264] [10 Regression] Compile time hog in 521.wrf_r with -Ofast -march=znver2 -g since r276318
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92264 --- Comment #30 from Jakub Jelinek --- Or perhaps just do: --- gcc/var-tracking.c.jj 2020-01-12 11:54:38.532381439 +0100 +++ gcc/var-tracking.c 2020-03-19 15:49:19.457340470 +0100 @@ -6112,7 +6112,8 @@ add_stores (rtx loc, const_rtx expr, voi } if (loc == stack_pointer_rtx - && maybe_ne (hard_frame_pointer_adjustment, -1) + && (maybe_ne (hard_frame_pointer_adjustment, -1) + || (!frame_pointer_needed && !ACCUMULATE_OUTGOING_ARGS)) && preserve) cselib_set_value_sp_based (v); and with that, retry the statistics gathering to see if now everything > 100 is NULL?
[Bug rtl-optimization/92264] [10 Regression] Compile time hog in 521.wrf_r with -Ofast -march=znver2 -g since r276318
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92264 --- Comment #29 from Jakub Jelinek --- So, the reason why the values aren't being marked as sp based is given in PR54796 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54796#c2 In these functions, there is no fp_setter insn, so hard_frame_pointer_adjustment is -1, all we have there is (insn/f:TI 2695 2874 2696 2 (set (mem:SI (pre_dec:SI (reg/f:SI 7 sp)) [0 S4 A8]) (reg:SI 6 bp)) "../../gcc/diagnostic-show-locus.c":3861:1 43 {*pushsi2} (expr_list:REG_DEAD (reg:SI 6 bp) (nil))) some pushes to preserve call saved regs (insn/f:TI 2699 2698 2700 2 (parallel [ (set (reg/f:SI 7 sp) (plus:SI (reg/f:SI 7 sp) (const_int -508 [0xfe04]))) (clobber (reg:CC 17 flags)) (clobber (mem:BLK (scratch) [0 A8])) ]) "../../gcc/diagnostic-show-locus.c":3861:1 1058 {pro_epilogue_adjust_stack_add_si} (expr_list:REG_UNUSED (reg:CC 17 flags) (expr_list:REG_CFA_ADJUST_CFA (set (reg/f:SI 7 sp) (plus:SI (reg/f:SI 7 sp) (const_int -508 [0xfe04]))) (nil (note 2700 2699 17 2 NOTE_INSN_PROLOGUE_END) Now, the reason we don't mark preserved values sp based unconditionally if they are sp based is the fear about the prologue and pre-prologue spots, as we need to disambiguate between sp based and hfp based VALUEs and they normally start the same as sp based. Not sure what we can do, perhaps also remember if we've seen in the prologue some important hfp related instruction, at that point somehow force using different VALUEs for hfp vs. sp based accesses (dunno if by cselib invalidation of hfp or of sp) and from that point on mark the clearly sp based VALUEs as sp_based.
[Bug rtl-optimization/92264] [10 Regression] Compile time hog in 521.wrf_r with -Ofast -march=znver2 -g since r276318
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92264 --- Comment #28 from Martin Liška --- (In reply to Richard Biener from comment #25) > Created attachment 48064 [details] > more localized caching > > Updated and simplified patch. Maybe it does help depending on how we have > shared locs for multiple values ... > > The update is to not bother updating cache values with non-NULL found ones. So this patch is running for the problematic wrf testcase for more that 2h30m..
[Bug rtl-optimization/92264] [10 Regression] Compile time hog in 521.wrf_r with -Ofast -march=znver2 -g since r276318
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92264 --- Comment #27 from Jakub Jelinek --- So, I've looked at two cases from those where visited_vals.length () > 100 returned non-NULL, in particular Wsizeof-pointer-memaccess1.c and diagnostic-show-locus.c. In both all the cases where it returned non-NULL and length () > 100 were those where it returns static_reg_base_value[STACK_POINTER_REGNUM]. Which seems exactly like the case for which cselib_sp_based_value_p has been invented, except it doesn't seem to work in this case the way it should. Will try to have a look why. Anyway, if that is fixed, perhaps we can just lower the param to 100?
[Bug rtl-optimization/92264] [10 Regression] Compile time hog in 521.wrf_r with -Ofast -march=znver2 -g since r276318
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92264 --- Comment #26 from Richard Biener --- (In reply to Martin Liška from comment #24) > (In reply to Richard Biener from comment #23) > > (In reply to Richard Biener from comment #22) > > > Created attachment 48063 [details] > > > more localized caching > > > > > > Like this. Martin, can you also check the effect on this one? > > > > We can actually simplify since we won't ever use non-NULL cached vals which > > also means this patch is a no-op and should be worse than the existing > > state :/ > > So no testing is needed? would be still interesting (but surprising) if it helps (doesn't matter whether the original or the updated patch, the update is just constant time improvement)
[Bug rtl-optimization/92264] [10 Regression] Compile time hog in 521.wrf_r with -Ofast -march=znver2 -g since r276318
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92264 Richard Biener changed: What|Removed |Added Attachment #48063|0 |1 is obsolete|| --- Comment #25 from Richard Biener --- Created attachment 48064 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48064=edit more localized caching Updated and simplified patch. Maybe it does help depending on how we have shared locs for multiple values ... The update is to not bother updating cache values with non-NULL found ones.
[Bug rtl-optimization/92264] [10 Regression] Compile time hog in 521.wrf_r with -Ofast -march=znver2 -g since r276318
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92264 --- Comment #24 from Martin Liška --- (In reply to Richard Biener from comment #23) > (In reply to Richard Biener from comment #22) > > Created attachment 48063 [details] > > more localized caching > > > > Like this. Martin, can you also check the effect on this one? > > We can actually simplify since we won't ever use non-NULL cached vals which > also means this patch is a no-op and should be worse than the existing > state :/ So no testing is needed?
[Bug rtl-optimization/92264] [10 Regression] Compile time hog in 521.wrf_r with -Ofast -march=znver2 -g since r276318
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92264 --- Comment #23 from Richard Biener --- (In reply to Richard Biener from comment #22) > Created attachment 48063 [details] > more localized caching > > Like this. Martin, can you also check the effect on this one? We can actually simplify since we won't ever use non-NULL cached vals which also means this patch is a no-op and should be worse than the existing state :/
[Bug rtl-optimization/92264] [10 Regression] Compile time hog in 521.wrf_r with -Ofast -march=znver2 -g since r276318
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92264 --- Comment #22 from Richard Biener --- Created attachment 48063 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48063=edit more localized caching Like this. Martin, can you also check the effect on this one?
[Bug rtl-optimization/92264] [10 Regression] Compile time hog in 521.wrf_r with -Ofast -march=znver2 -g since r276318
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92264 --- Comment #21 from Richard Biener --- (In reply to Jakub Jelinek from comment #19) > I think caching is problematic, for a couple of reasons: > 1) for non-cselib_preserved_value_p, the loc list is dynamic and keeps > changing, locs are added and removed as we go through the basic blocks Sure, but if you look at my patch I'm caching on individual locs, not on the whole list. That then still doesn't avoid traversing the whole list if we don't find a base but we'd at least elide further recursion. The base we pick also depends on the ordering of the list currently if there ever are two possible choices. > 2) because of the recursion prevention, doesn't it matter from exactly what > VALUE we start walking (in case we have a cycle or cycles)? I think the outcome is essentially random anyways since we pick the first base we find. I'm quite sure that if we collected "all" bases we'd find they are not equivalent in some cases. > 3) plus the new param on visited_vals, if we reach it, caching is unreliable Sure, but does anybody care? Note what I'd really propose would be find_base_term-local caching, eliding both the recursion prevention and the param.
[Bug rtl-optimization/92264] [10 Regression] Compile time hog in 521.wrf_r with -Ofast -march=znver2 -g since r276318
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92264 --- Comment #20 from Martin Liška --- (In reply to Richard Biener from comment #17) > Created attachment 48061 [details] > cache base term > > I wonder if we could simply cache the base terms in elt_loc_list? Does that > make a difference? Yes: $ time gfortran module_configure.fppized.f90 -c -march=znver2 -std=legacy -fconvert=big-endian -fno-openmp -Ofast -march=znver2 -g real1m43.067s user1m42.410s $ perf report --stdio | head -n20 ... 6.26% f951 f951 [.] get_ref_base_and_extent 5.85% f951 f951 [.] bitmap_set_bit 5.72% f951 f951 [.] find_base_term 3.73% f951 f951 [.] pre_expr_reaches_here_p_work 2.77% f951 f951 [.] fsm_find_thread_path 2.62% f951 f951 [.] rtx_equal_for_memref_p 2.32% f951 f951 [.] build_object_conflicts 2.17% f951 f951 [.] memrefs_conflict_p 1.88% f951 f951 [.] ira_build_conflicts
[Bug rtl-optimization/92264] [10 Regression] Compile time hog in 521.wrf_r with -Ofast -march=znver2 -g since r276318
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92264 --- Comment #19 from Jakub Jelinek --- I think caching is problematic, for a couple of reasons: 1) for non-cselib_preserved_value_p, the loc list is dynamic and keeps changing, locs are added and removed as we go through the basic blocks 2) because of the recursion prevention, doesn't it matter from exactly what VALUE we start walking (in case we have a cycle or cycles)? 3) plus the new param on visited_vals, if we reach it, caching is unreliable I've performed {x86_64,i686}-linux bootstraps+regtests overnight with the above mentioned statistics gathering and the end result is 5530878 cases where find_base_term had visited_vals.length () > 100 on return, but only 1296 cases out of that where with visited_vals.length () > 100 we actually returned non-NULL, and no cases of visited_vals.length () > 500 where we returned non-NULL. I'm listing just the cases where it returned non-NULL, as can be seen, all are for -m32 and guess it would be nice to have a look at them what we actually manage to return and how the VALUE chains look like. $ grep -v '0$' fbt | sort | uniq -c | sort -r -k +5 -n 1 32 ../../gcc/diagnostic-show-locus.c selftest::test_diagnostic_show_locus_one_liner_utf8 484 1 1 32 ../../gcc/diagnostic-show-locus.c selftest::test_diagnostic_show_locus_one_liner_utf8 482 1 1 32 ../../gcc/diagnostic-show-locus.c selftest::test_diagnostic_show_locus_one_liner_utf8 481 1 3 32 ../../gcc/diagnostic-show-locus.c selftest::test_diagnostic_show_locus_one_liner_utf8 480 1 2 32 ../../gcc/diagnostic-show-locus.c selftest::test_diagnostic_show_locus_one_liner_utf8 479 1 1 32 ../../gcc/diagnostic-show-locus.c selftest::test_diagnostic_show_locus_one_liner_utf8 477 1 1 32 ../../gcc/diagnostic-show-locus.c selftest::test_diagnostic_show_locus_one_liner_utf8 476 1 1 32 ../../gcc/diagnostic-show-locus.c selftest::test_diagnostic_show_locus_one_liner_utf8 473 1 3 32 ../../gcc/diagnostic-show-locus.c selftest::test_diagnostic_show_locus_one_liner_utf8 472 1 1 32 ../../gcc/diagnostic-show-locus.c selftest::test_diagnostic_show_locus_one_liner_utf8 469 1 1 32 ../../gcc/diagnostic-show-locus.c selftest::test_diagnostic_show_locus_one_liner_utf8 468 1 1 32 ../../gcc/diagnostic-show-locus.c selftest::test_diagnostic_show_locus_one_liner_utf8 467 1 1 32 ../../gcc/diagnostic-show-locus.c selftest::test_diagnostic_show_locus_one_liner_utf8 464 1 1 32 ../../gcc/diagnostic-show-locus.c selftest::test_diagnostic_show_locus_one_liner_utf8 463 1 1 32 ../../gcc/diagnostic-show-locus.c selftest::test_diagnostic_show_locus_one_liner_utf8 462 1 1 32 ../../gcc/diagnostic-show-locus.c selftest::test_diagnostic_show_locus_one_liner_utf8 458 1 1 32 ../../gcc/diagnostic-show-locus.c selftest::test_diagnostic_show_locus_one_liner_utf8 456 1 1 32 ../../gcc/diagnostic-show-locus.c selftest::test_diagnostic_show_locus_one_liner_utf8 455 1 3 32 ../../gcc/diagnostic-show-locus.c selftest::test_diagnostic_show_locus_one_liner_utf8 454 1 2 32 ../../gcc/diagnostic-show-locus.c selftest::test_diagnostic_show_locus_one_liner_utf8 453 1 1 32 ../../gcc/diagnostic-show-locus.c selftest::test_diagnostic_show_locus_one_liner_utf8 451 1 1 32 ../../gcc/diagnostic-show-locus.c selftest::test_diagnostic_show_locus_one_liner_utf8 450 1 1 32 ../../gcc/diagnostic-show-locus.c selftest::test_diagnostic_show_locus_one_liner_utf8 447 1 3 32 ../../gcc/diagnostic-show-locus.c selftest::test_diagnostic_show_locus_one_liner_utf8 446 1 1 32 ../../gcc/diagnostic-show-locus.c selftest::test_diagnostic_show_locus_one_liner_utf8 443 1 1 32 ../../gcc/diagnostic-show-locus.c selftest::test_diagnostic_show_locus_one_liner_utf8 442 1 1 32 ../../gcc/diagnostic-show-locus.c selftest::test_diagnostic_show_locus_one_liner_utf8 439 1 1 32 ../../gcc/diagnostic-show-locus.c selftest::test_diagnostic_show_locus_one_liner_utf8 438 1 1 32 ../../gcc/diagnostic-show-locus.c selftest::test_diagnostic_show_locus_one_liner_utf8 437 1 1 32 ../../gcc/diagnostic-show-locus.c selftest::test_diagnostic_show_locus_one_liner_utf8 433 1 1 32 ../../gcc/diagnostic-show-locus.c selftest::test_diagnostic_show_locus_one_liner_utf8 431 1 1 32 ../../gcc/diagnostic-show-locus.c selftest::test_diagnostic_show_locus_one_liner_utf8 430 1 3 32 ../../gcc/diagnostic-show-locus.c selftest::test_diagnostic_show_locus_one_liner_utf8 429 1 2 32 ../../gcc/diagnostic-show-locus.c selftest::test_diagnostic_show_locus_one_liner_utf8 428 1 1 32 ../../gcc/diagnostic-show-locus.c selftest::test_diagnostic_show_locus_one_liner_utf8 426 1 1 32 ../../gcc/diagnostic-show-locus.c selftest::test_diagnostic_show_locus_one_liner_utf8 425 1 1 32 ../../gcc/diagnostic-show-locus.c selftest::test_diagnostic_show_locus_one_liner_utf8 422 1 3 32
[Bug rtl-optimization/92264] [10 Regression] Compile time hog in 521.wrf_r with -Ofast -march=znver2 -g since r276318
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92264 --- Comment #18 from Richard Biener --- Note also that param_max_find_base_term_values limits recursion depth but not width (the loc list traversals). The original visited_vals thing was to prevent infinite recursion only. If the global caching works a safer approach would be to turn that visited_vals things into a local cache and see if that's enough as well.
[Bug rtl-optimization/92264] [10 Regression] Compile time hog in 521.wrf_r with -Ofast -march=znver2 -g since r276318
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92264 --- Comment #17 from Richard Biener --- Created attachment 48061 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48061=edit cache base term I wonder if we could simply cache the base terms in elt_loc_list? Does that make a difference?
[Bug rtl-optimization/92264] [10 Regression] Compile time hog in 521.wrf_r with -Ofast -march=znver2 -g since r276318
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92264 --- Comment #16 from Martin Liška --- PR88440 is also slightly related where enablement of -ftree-loop-distribute-patterns caused longer compilation: 521.wrf_r: 310 -> 346s
[Bug rtl-optimization/92264] [10 Regression] Compile time hog in 521.wrf_r with -Ofast -march=znver2 -g since r276318
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92264 --- Comment #15 from Richard Biener --- WRF initial_config has very very very many (nested) loops to initialize globals. IIRC there's a related bug running into the very same issue when prefetching is enabled.
[Bug rtl-optimization/92264] [10 Regression] Compile time hog in 521.wrf_r with -Ofast -march=znver2 -g since r276318
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92264 --- Comment #14 from Martin Liška --- For firefox with LTO we get: $ wc -l /tmp/fbt 1645 /tmp/fbt $ sort /tmp/fbt | uniq -c | sort -n | tac | head -n20 10 64 /tmp/libxul.so.J1HwqB.ltrans17.o CollectReports 103 0 10 64 /tmp/libxul.so.J1HwqB.ltrans17.o CollectReports 101 0 6 64 /tmp/libxul.so.J1HwqB.ltrans17.o CollectReports 106 0 2 64 /tmp/libxul.so.J1HwqB.ltrans55.o GetFormat 997 0 2 64 /tmp/libxul.so.J1HwqB.ltrans55.o GetFormat 996 0 2 64 /tmp/libxul.so.J1HwqB.ltrans55.o GetFormat 995 0 2 64 /tmp/libxul.so.J1HwqB.ltrans55.o GetFormat 994 0 2 64 /tmp/libxul.so.J1HwqB.ltrans55.o GetFormat 993 0 2 64 /tmp/libxul.so.J1HwqB.ltrans55.o GetFormat 992 0 2 64 /tmp/libxul.so.J1HwqB.ltrans55.o GetFormat 991 0 2 64 /tmp/libxul.so.J1HwqB.ltrans55.o GetFormat 988 0 2 64 /tmp/libxul.so.J1HwqB.ltrans55.o GetFormat 987 0 2 64 /tmp/libxul.so.J1HwqB.ltrans55.o GetFormat 986 0 2 64 /tmp/libxul.so.J1HwqB.ltrans55.o GetFormat 985 0 2 64 /tmp/libxul.so.J1HwqB.ltrans55.o GetFormat 984 0 2 64 /tmp/libxul.so.J1HwqB.ltrans55.o GetFormat 983 0 2 64 /tmp/libxul.so.J1HwqB.ltrans55.o GetFormat 982 0 2 64 /tmp/libxul.so.J1HwqB.ltrans55.o GetFormat 979 0 2 64 /tmp/libxul.so.J1HwqB.ltrans55.o GetFormat 978 0 2 64 /tmp/libxul.so.J1HwqB.ltrans55.o GetFormat 977 0 there are no bigger values than 1000.
[Bug rtl-optimization/92264] [10 Regression] Compile time hog in 521.wrf_r with -Ofast -march=znver2 -g since r276318
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92264 --- Comment #13 from Martin Liška --- So for the problematic wrf file we get: $ gfortran module_configure.fppized.f90 -c -march=znver2 -std=legacy -fconvert=big-endian -fno-openmp -Ofast -march=znver2 -g ... $ wc -l /tmp/fbt 26273112 /tmp/fbt $ sort /tmp/fbt | uniq -c | sort -n | tac | head -n10 24666359 64 module_configure.fppized.f90 initial_config 2001 0 482996 64 module_configure.fppized.f90 initial_config 757 0 336377 64 module_configure.fppized.f90 initial_config 620 0 321717 64 module_configure.fppized.f90 initial_config 607 0 288735 64 module_configure.fppized.f90 initial_config 571 0 101785 64 module_configure.fppized.f90 initial_config 317 0 23376 64 module_configure.fppized.f90 initial_config 127 0 22014 64 module_configure.fppized.f90 initial_config 120 0 18135 64 module_configure.fppized.f90 initial_config 107 0 41 64 module_configure.fppized.f90 initial_config 115 0
[Bug rtl-optimization/92264] [10 Regression] Compile time hog in 521.wrf_r with -Ofast -march=znver2 -g since r276318
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92264 --- Comment #12 from Jakub Jelinek --- I guess we can consider lowering the default value of the param and/or having separate param for var-tracking vs. for normal RTL optimizations. But before doing that I think we want to gather some statistics to help us decide on which value we want, like: --- gcc/alias.c 2020-03-09 13:38:04.534284063 +0100 +++ gcc/alias.c 2020-03-18 14:20:03.669278109 +0100 @@ -2116,6 +2116,14 @@ find_base_term (rtx x) rtx res = find_base_term (x, visited_vals); for (unsigned i = 0; i < visited_vals.length (); ++i) visited_vals[i].first->locs = visited_vals[i].second; + if (visited_vals.length () > 100) +{ + FILE *f = fopen ("/tmp/fbt", "a"); + fprintf (f, "%d %s %s %d %d\n", (int) BITS_PER_WORD, + main_input_filename ? main_input_filename : "-", + current_function_name (), visited_vals.length (), res != NULL_RTX); + fclose (f); +} return res; } and use it on both x86_64-linux and i686-linux bootstrap/regtest (perhaps other targets too) and perhaps on some other packages too (firefox, libreoffice, ...).
[Bug rtl-optimization/92264] [10 Regression] Compile time hog in 521.wrf_r with -Ofast -march=znver2 -g since r276318
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92264 --- Comment #11 from Martin Liška --- So it finishes in 50m16s.
[Bug rtl-optimization/92264] [10 Regression] Compile time hog in 521.wrf_r with -Ofast -march=znver2 -g since r276318
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92264 --- Comment #10 from Jakub Jelinek --- (In reply to Martin Liška from comment #9) > With --param max-find-base-term-values=100 it takes 4m24s. > With --param max-find-base-term-values=1 it takes 2m22s. So, if you e.g. bump the timeout to 30 minutes, does it complete even with the default setting and how long does it take?
[Bug rtl-optimization/92264] [10 Regression] Compile time hog in 521.wrf_r with -Ofast -march=znver2 -g since r276318
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92264 --- Comment #9 from Martin Liška --- With --param max-find-base-term-values=100 it takes 4m24s. With --param max-find-base-term-values=1 it takes 2m22s.
[Bug rtl-optimization/92264] [10 Regression] Compile time hog in 521.wrf_r with -Ofast -march=znver2 -g since r276318
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92264 --- Comment #8 from Martin Liška --- With --param max-find-base-term-values=10 it takes 2m34s.
[Bug rtl-optimization/92264] [10 Regression] Compile time hog in 521.wrf_r with -Ofast -march=znver2 -g since r276318
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92264 --- Comment #7 from Martin Liška --- (In reply to Jakub Jelinek from comment #6) > In r10-7086-g2e94d3ee47be0742df843d95e3d1bf1da11e4796 I've added a param > controlled cap on the number of VALUEs walked by a single toplevel > find_base_term. > Does it make a difference on this? The current master still fails with a timeout of 5 minutes. Perf looks a bit lower for find_base_term: 46.64% f951 f951 [.] find_base_term 8.91% f951 f951 [.] ix86_find_base_term 2.19% f951 libc-2.31.so [.] _int_malloc 2.15% f951 f951 [.] get_ref_base_and_extent 2.11% f951 f951 [.] bitmap_set_bit 2.03% f951 f951 [.] find_base_term 1.75% f951 f951 [.] cselib_sp_based_value_p 1.52% f951 f951 [.] pre_expr_reaches_here_p_work 1.51% f951 f951 [.] fsm_find_thread_path 1.20% f951 f951 [.] build_object_conflicts
[Bug rtl-optimization/92264] [10 Regression] Compile time hog in 521.wrf_r with -Ofast -march=znver2 -g since r276318
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92264 Jakub Jelinek changed: What|Removed |Added CC||jakub at gcc dot gnu.org --- Comment #6 from Jakub Jelinek --- In r10-7086-g2e94d3ee47be0742df843d95e3d1bf1da11e4796 I've added a param controlled cap on the number of VALUEs walked by a single toplevel find_base_term. Does it make a difference on this?
[Bug rtl-optimization/92264] [10 Regression] Compile time hog in 521.wrf_r with -Ofast -march=znver2 -g since r276318
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92264 Richard Biener changed: What|Removed |Added Priority|P3 |P1
[Bug rtl-optimization/92264] [10 Regression] Compile time hog in 521.wrf_r with -Ofast -march=znver2 -g since r276318
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92264 rsandifo at gcc dot gnu.org changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2019-11-19 Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #5 from rsandifo at gcc dot gnu.org --- Mine.
[Bug rtl-optimization/92264] [10 Regression] Compile time hog in 521.wrf_r with -Ofast -march=znver2 -g since r276318
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92264 --- Comment #4 from Martin Liška --- > If you have handy access to the reproducer, is it -g that makes > the difference? Yes.
[Bug rtl-optimization/92264] [10 Regression] Compile time hog in 521.wrf_r with -Ofast -march=znver2 -g since r276318
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92264 --- Comment #3 from rsandifo at gcc dot gnu.org --- Similar to what Richard says, this sounds like a latent bug. One of the effects of that rev was to prevent unnecessary invalidation of equivalences based on the stack pointer and frame pointer (which stay the same after a call). It sounds like this has triggered something that could have happened anyway if there weren't any calls in the way. If you have handy access to the reproducer, is it -g that makes the difference? var-tracking can generate very large cselib tables and I'm wondering if it comes from there.
[Bug rtl-optimization/92264] [10 Regression] Compile time hog in 521.wrf_r with -Ofast -march=znver2 -g since r276318
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92264 --- Comment #2 from Martin Liška --- Just for the record, the compilation takes now ~2:30 hours.
[Bug rtl-optimization/92264] [10 Regression] Compile time hog in 521.wrf_r with -Ofast -march=znver2 -g since r276318
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92264 Richard Biener changed: What|Removed |Added Target||x86_64-linux-gnu Host|x86_64-linux-gnu| Target Milestone|--- |10.0 --- Comment #1 from Richard Biener --- Note find_base_term is known to be quadratic ... possibly we were able to short-cut this much more often before the rev.