Richard Sandiford <[email protected]> writes:
> Richard Biener <[email protected]> writes:
>> On Wed, 2 Apr 2025, Richard Sandiford wrote:
>>
>>> Richard Biener <[email protected]> writes:
>>> > On Tue, 1 Apr 2025, Richard Sandiford wrote:
>>> >
>>> >> The problem in PR101523 was that, after each successful 2->2
>>> >> combination attempt, distribute_links would search further and
>>> >> further for the next combinable use of the i2 destination.
>>> >> The original patch for the PR dealt with that by disallowing such
>>> >> combinations. However, that led to various optimisation regressions,
>>> >> so there was interest in allowing the combinations again, at least
>>> >> until an alternative way of getting the same results is in place.
>>> >>
>>> >> This patch does that, but limits distribute_links to a certain number
>>> >> of instructions when i2 is unchanged. As Segher said in the PR trail,
>>> >> it would make more conceptual sense to apply the limit unconditionally,
>>> >> but I thought it would be better to change as little as possible at
>>> >> this development stage. Logically, in stage 1, the --param should
>>> >> be applied directly by distribute_links with no input from callers.
>>> >>
>>> >> As I mentioned in:
>>> >>
>>> >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116398#c28
>>> >>
>>> >> I think it's safe to drop log links even if a use exists. All
>>> >> processing of log links seems to handle the absence of a link
>>> >> for a particular register in a conservative way.
>>> >>
>>> >> The initial set-up errs on the side of dropping links, since for example
>>> >> create_log_links has:
>>> >>
>>> >> /* flow.c claimed:
>>> >>
>>> >> We don't build a LOG_LINK for hard registers contained
>>> >> in ASM_OPERANDs. If these registers get replaced,
>>> >> we might wind up changing the semantics of the insn,
>>> >> even if reload can make what appear to be valid
>>> >> assignments later. */
>>> >> if (regno < FIRST_PSEUDO_REGISTER
>>> >> && asm_noperands (PATTERN (use_insn)) >= 0)
>>> >> continue;
>>> >>
>>> >> which excludes combinations by dropping log links, rather than during
>>> >> try_combine. And:
>>> >>
>>> >> /* If this register is being initialized using itself, and the
>>> >> register is uninitialized in this basic block, and there are
>>> >> no LOG_LINKS which set the register, then part of the
>>> >> register is uninitialized. In that case we can't assume
>>> >> anything about the number of nonzero bits.
>>> >>
>>> >> ??? We could do better if we checked this in
>>> >> reg_{nonzero_bits,num_sign_bit_copies}_for_combine. Then we
>>> >> could avoid making assumptions about the insn which initially
>>> >> sets the register, while still using the information in other
>>> >> insns. We would have to be careful to check every insn
>>> >> involved in the combination. */
>>> >>
>>> >> if (insn
>>> >> && reg_referenced_p (x, PATTERN (insn))
>>> >> && !REGNO_REG_SET_P (DF_LR_IN (BLOCK_FOR_INSN (insn)),
>>> >> REGNO (x)))
>>> >> {
>>> >> struct insn_link *link;
>>> >>
>>> >> FOR_EACH_LOG_LINK (link, insn)
>>> >> if (dead_or_set_p (link->insn, x))
>>> >> break;
>>> >> if (!link)
>>> >> {
>>> >> rsp->nonzero_bits = GET_MODE_MASK (mode);
>>> >> rsp->sign_bit_copies = 1;
>>> >> return;
>>> >> }
>>> >> }
>>> >>
>>> >> treats the lack of a log link is a possible sign of uninitialised data,
>>> >> but that would be a missed optimisation rather than a correctness issue.
>>> >>
>>> >> One question is what the default --param value should be. I went with
>>> >> Jakub's suggestion of 3000 from:
>>> >>
>>> >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116398#c25
>>> >>
>>> >> Also, to answer Jakub's question in that comment, I tried bisecting:
>>> >>
>>> >> int limit = atoi (getenv ("BISECT"));
>>> >>
>>> >> (so applying the limit for all calls from try_combine) with an
>>> >> abort in distribute_links if the limit caused a link to be skipped.
>>> >> The minimum BISECT value that allowed an aarch64-linux-gnu bootstrap
>>> >> to succeed with --enable-languages=all --enable-checking=yes,rtl,extra
>>> >> was 142, so much lower than the parameter value. I realised too late
>>> >> that --enable-checking=release would probably have been a more
>>> >> interesting test.
>>> >>
>>> >> Some of the try_combine changes come from Richi's patch in:
>>> >>
>>> >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101523#c53
>>> >>
>>> >> Bootstrapped & regression-tested on aarch64-linux-gnu,
>>> >> powerpc64le-linux-gnu and x86_64-linux-gnu. OK to install?
>>> >>
>>> >> Richard
>>> >>
>>> >>
>>> >> gcc/
>>> >> PR testsuite/116398
>>> >> * params.opt (-param=max-combine-search-insns=): New param.
>>> >> * doc/invoke.texi: Document it.
>>> >> * combine.cc (distribute_links): Add a limit parameter.
>>> >> (try_combine): Use the new param to limit distribute_links
>>> >> when i2 is unchanged. Return i3 rather than i2 if i2 is unchanged.
>>> >>
>>> >> gcc/testsuite/
>>> >> * gcc.target/aarch64/popcnt-le-1.c: Account for commutativity of TST.
>>> >> * gcc.target/aarch64/popcnt-le-3.c: Likewise AND.
>>> >> * gcc.target/aarch64/sve/pred-not-gen-1.c: Revert previous patch.
>>> >> * gcc.target/aarch64/sve/pred-not-gen-4.c: Likewise.
>>> >> * gcc.target/aarch64/sve/var_stride_2.c: Likewise.
>>> >> * gcc.target/aarch64/sve/var_stride_4.c: Likewise.
>>> >>
>>> >> Co-authored-by: Richard Biener <[email protected]>
>>> >> ---
>>> >> gcc/combine.cc | 30 +++++++++----------
>>> >> gcc/doc/invoke.texi | 9 ++++++
>>> >> gcc/params.opt | 4 +++
>>> >> .../gcc.target/aarch64/popcnt-le-1.c | 4 +--
>>> >> .../gcc.target/aarch64/popcnt-le-3.c | 4 +--
>>> >> gcc/testsuite/gcc.target/aarch64/pr100056.c | 4 ++-
>>> >> .../gcc.target/aarch64/sve/pred-not-gen-1.c | 4 +--
>>> >> .../gcc.target/aarch64/sve/pred-not-gen-4.c | 4 +--
>>> >> .../gcc.target/aarch64/sve/var_stride_2.c | 3 +-
>>> >> .../gcc.target/aarch64/sve/var_stride_4.c | 3 +-
>>> >> 10 files changed, 42 insertions(+), 27 deletions(-)
>>> >>
>>> >> diff --git a/gcc/combine.cc b/gcc/combine.cc
>>> >> index ef13f5d5e90..28403d1f9e0 100644
>>> >> --- a/gcc/combine.cc
>>> >> +++ b/gcc/combine.cc
>>> >> @@ -472,7 +472,7 @@ static void move_deaths (rtx, rtx, int, rtx_insn *,
>>> >> rtx *);
>>> >> static bool reg_bitfield_target_p (rtx, rtx);
>>> >> static void distribute_notes (rtx, rtx_insn *, rtx_insn *, rtx_insn *,
>>> >> rtx, rtx, rtx);
>>> >> -static void distribute_links (struct insn_link *);
>>> >> +static void distribute_links (struct insn_link *, int limit = INT_MAX);
>>> >> static void mark_used_regs_combine (rtx);
>>> >> static void record_promoted_value (rtx_insn *, rtx);
>>> >> static bool unmentioned_reg_p (rtx, rtx);
>>> >> @@ -4208,16 +4208,7 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn
>>> >> *i1, rtx_insn *i0,
>>> >> adjust_for_new_dest (i3);
>>> >> }
>>> >>
>>> >> - /* If I2 didn't change, this is not a combination (but a
>>> >> simplification or
>>> >> - canonicalisation with context), which should not be done here.
>>> >> Doing
>>> >> - it here explodes the algorithm. Don't. */
>>> >> - if (rtx_equal_p (newi2pat, PATTERN (i2)))
>>> >> - {
>>> >> - if (dump_file)
>>> >> - fprintf (dump_file, "i2 didn't change, not doing this\n");
>>> >> - undo_all ();
>>> >> - return 0;
>>> >> - }
>>> >> + bool i2_unchanged = rtx_equal_p (newi2pat, PATTERN (i2));
>>> >>
>>> >> /* We now know that we can do this combination. Merge the insns and
>>> >> update the status of registers and LOG_LINKS. */
>>> >> @@ -4593,10 +4584,11 @@ try_combine (rtx_insn *i3, rtx_insn *i2,
>>> >> rtx_insn *i1, rtx_insn *i0,
>>> >> NULL_RTX, NULL_RTX, NULL_RTX);
>>> >> }
>>> >>
>>> >> - distribute_links (i3links);
>>> >> - distribute_links (i2links);
>>> >> - distribute_links (i1links);
>>> >> - distribute_links (i0links);
>>> >> + int limit = i2_unchanged ? param_max_combine_search_insns : INT_MAX;
>>> >> + distribute_links (i3links, limit);
>>> >> + distribute_links (i2links, limit);
>>> >> + distribute_links (i1links, limit);
>>> >> + distribute_links (i0links, limit);
>>> >
>>> > I'm not sure I understand how the distribute_links mixes in with
>>> > the change below.
>>> >
>>> > The original issue reported in the PR was that returning i2
>>> > caused evern insn between i2 and i3 to be re-processed by the
>>> > main loop, re-trying previously failed combine attempts and
>>> > producing garbage.
>>> >
>>> > The disadvantage of directly returning i3 is that if added_links_insn
>>> > was inbetween i2 and i3 we've missed to re-try on insn with changed
>>> > links.
>>>
>>> Can that happen though? log links are supposed to be attached to the
>>> next use of the register (only), so I wouldn't expect the next use to be
>>> before i3 (putatively the previous first use).
>>
>> But log-links on i2 are for uses in i2 for defs before i2, given
>> i0/i1/i2 changed/have gone away we adjust those to eventually end
>> up on insns between i2 and i3 (or indeed after i3). Combine then
>> want's to try combine the insns with changed log-links.
>
> Right. But I meant in the case where i2 hasn't changed, which
> I thought was the case in contention. If i2 hasn't changed then
> its uses are the same, and so there is no need to distribute its
> log links.
>
>>> > But what you do now also limits the insn walk in distribute_links
>>> > (IMO a good thing on its own), but not taking advantage of that
>>> > when i2_unchanged (distributing the link to insns between i2 and i3
>>> > is useless in that case).
>>>
>>> Yeah, it's true that distribute_links could potentially start the
>>> search at a better place. But that's IMO a third, somewhat independent
>>> optimisation. And it seems more dangerous than the proposed patch,
>>> since if it turns out that the next use is before i3 in some current
>>> or future circumstances, we could end up generating wrong code.
>>>
>>> > Ideally we'd distribute links to insns between i2 and i3 in a backward
>>> > walk from i3, but pick up the first use if within the limit, as we
>>> > want to limit the number of insns we reprocess in the try_combine callers
>>> > loop.
>>> >
>>> > Really ideally we'd want to remember all added_links_insn and only
>>> > re-process those, not all other unaffected insns between i2 and i3
>>> > (or earlier).
>>> >
>>> > That said, your change looks like two independent changes? IIRC Jakub
>>> > at some point proposed to limit the 'return i3' below to cases where
>>> > i2 is too far away or sth like that.
>>>
>>> Yes, it's two independent changes. The return i3 one is taken directly
>>> from your comment in:
>>>
>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101523#c53
>>>
>>> I think that's the right thing to do, for the reason you explained there
>>> and above, so I didn't want to lose it. But like you say in the PR,
>>> combine did still take a large slice of time with that change in isolation.
>>>
>>> Jakub discussed the idea of augmenting that change with one that makes
>>> try_combine check luids. The distribute_links part of the patch is
>>> supposed to be an alternative to that.
>>
>> I see. I understood Jakub thinks my change is too aggressive since
>> it does indeed miss combine attempts from changed log-links.
>>
>> I remember that when allowing added_links_insn to override the
>> i3 return for i2_unchaged_p the bad compile-time re-surfaced. So
>> now I wonder whether, if i2_unchanged, distribute_links (i2links)
>> does (should do?) anything? I don't see anything that would avoid
>> adjusting links when there's still a use in newi2_patt (IIRC
>> i0 and i1 are always gone?).
>
> Yeah, that's also what I meant. But, once the "return i3" is in place,
> the remaining large compile time partly comes from distributing i3's link
> for i2's destination, and I suppose from the general "walk all instructions
> between A and B" that try_combine uses to check dataflow, with A and B
> getting gradually further apart.
>
> FWIW, with:
>
> diff --git a/gcc/combine.cc b/gcc/combine.cc
> index ef13f5d5e90..e12a78c8d89 100644
> --- a/gcc/combine.cc
> +++ b/gcc/combine.cc
> @@ -4208,16 +4208,7 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1,
> rtx_insn *i0,
> adjust_for_new_dest (i3);
> }
>
> - /* If I2 didn't change, this is not a combination (but a simplification or
> - canonicalisation with context), which should not be done here. Doing
> - it here explodes the algorithm. Don't. */
> - if (rtx_equal_p (newi2pat, PATTERN (i2)))
> - {
> - if (dump_file)
> - fprintf (dump_file, "i2 didn't change, not doing this\n");
> - undo_all ();
> - return 0;
> - }
> + bool only_i3_changed = !i0 && !i1 && rtx_equal_p (newi2pat, PATTERN (i2));
>
> /* We now know that we can do this combination. Merge the insns and
> update the status of registers and LOG_LINKS. */
> @@ -4785,6 +4776,9 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1,
> rtx_insn *i0,
> combine_successes++;
> undo_commit ();
>
> + if (only_i3_changed)
> + return i3;
> +
> rtx_insn *ret = newi2pat ? i2 : i3;
> if (added_links_insn && DF_INSN_LUID (added_links_insn) < DF_INSN_LUID
> (ret))
> ret = added_links_insn;
>
> so essentially your patch, I see an 11.9x improvement in combine's
> compile time in the original testcase for PR101523, going from 88%
> to 41% of total compile time. Memory usage improves 110x, from
> 97% to 25%. The output is unchanged. Adding:
>
> diff --git a/gcc/combine.cc b/gcc/combine.cc
> index e12a78c8d89..becfd0a65ff 100644
> --- a/gcc/combine.cc
> +++ b/gcc/combine.cc
> @@ -472,7 +472,7 @@ static void move_deaths (rtx, rtx, int, rtx_insn *, rtx
> *);
> static bool reg_bitfield_target_p (rtx, rtx);
> static void distribute_notes (rtx, rtx_insn *, rtx_insn *, rtx_insn *,
> rtx, rtx, rtx);
> -static void distribute_links (struct insn_link *);
> +static void distribute_links (struct insn_link *, rtx_insn * = nullptr);
> static void mark_used_regs_combine (rtx);
> static void record_promoted_value (rtx_insn *, rtx);
> static bool unmentioned_reg_p (rtx, rtx);
> @@ -4209,6 +4209,8 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1,
> rtx_insn *i0,
> }
>
> bool only_i3_changed = !i0 && !i1 && rtx_equal_p (newi2pat, PATTERN (i2));
> + if (only_i3_changed)
> + swap_i2i3 = false;
Oops, this was supposed to be split_i2i3. With that fixed...
>
> /* We now know that we can do this combination. Merge the insns and
> update the status of registers and LOG_LINKS. */
> @@ -4584,8 +4586,8 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1,
> rtx_insn *i0,
> NULL_RTX, NULL_RTX, NULL_RTX);
> }
>
> - distribute_links (i3links);
> - distribute_links (i2links);
> + distribute_links (i3links, only_i3_changed ? i3 : nullptr);
> + distribute_links (i2links, !i0 && !i1 ? i2 : nullptr);
> distribute_links (i1links);
> distribute_links (i0links);
>
> @@ -14978,10 +14980,13 @@ distribute_notes (rtx notes, rtx_insn *from_insn,
> rtx_insn *i3, rtx_insn *i2,
>
> /* Similarly to above, distribute the LOG_LINKS that used to be present on
> I3, I2, and I1 to new locations. This is also called to add a link
> - pointing at I3 when I3's destination is changed. */
> + pointing at I3 when I3's destination is changed.
> +
> + If START is nonnull and an insn, we know that the next location for
> + each link is no earlier than START. */
>
> static void
> -distribute_links (struct insn_link *links)
> +distribute_links (struct insn_link *links, rtx_insn *start)
> {
> struct insn_link *link, *next_link;
>
> @@ -15047,7 +15052,10 @@ distribute_links (struct insn_link *links)
> I3 to I2. Also note that not much searching is typically done here
> since most links don't point very far away. */
>
> - for (insn = NEXT_INSN (link->insn);
> + insn = start;
> + if (!insn || NOTE_P (insn))
> + insn = NEXT_INSN (link->insn);
> + for (;
> (insn && (this_basic_block->next_bb == EXIT_BLOCK_PTR_FOR_FN (cfun)
> || BB_HEAD (this_basic_block->next_bb) != insn));
> insn = NEXT_INSN (insn))
>
> improves compile time a further 17.7%, going from 41% to 33% of
[I meant 37%, gah]
> compile time.
...the improvement becomes 34.6%, going from 41% to 34%.
Thanks,
Richard