Re: [PATCH] PR tree-optimization/112508: Intelligently throttle loop store motion with -Os

Richard Biener Tue, 19 May 2026 23:14:14 -0700

On Sat, May 16, 2026 at 2:32 PM Roger Sayle <[email protected]> wrote:
>
>
> Hi Richard,
> Many thanks for the review.
>
> > Please also cache ref_always_accessed_p, it's a quite expensive walk over 
> > all
> > accesses.
>
> Can you be a bit more specific about why you think it's expensive to walk over
> accesses (which are already indexed by ref in vectors), and where/how you'd
> like to see caching implemented?  The reason I ask, is that after much 
> profiling
> and instrumentation, I don't see a significant amount of time being in spent 
> in
> this code.
>
> In a stage2 and stage3 of a GCC bootstrap, for_all_locs_in_loop is called
> 74433 times, calls the functor FN 154251 times, of which 125977 (81.7%)
> are ref_always_accessed.  53260 (42%) of these calls fail the store check,
> 62904 (50%) calls are not always executed, 9125 (7%) calls are always
> executed in this loop or a subloop, and 688 (0.5%) calls always executed
> but outside of this loop.
>
> Investigating the use of bsearch in for_all_locs_in_loop, the comparison
> function is executed 150818 times (averaging 2 times per bsearch), with
> the largest ref->accesses_in_loop vector being 244 elements.  The
> distribution in access_in_loop lengths is very skewed, with 61% of the
> calls to bsearch being on vectors of length 4 or less.
>
> len=1 8260/74433 (11%)
> len=2 23054/74433 (31%)
> len=3 5750/74433 (8%)
> len=4 8331/74433 (11%)
>
> Running the testsuite with "make -k check", for_all_locs_in_loop is called
> 171306 times, calls the functor FN 247866 times.  The comparator is called
> 204502 times (1.2 times per bsearch), the largest accesses_in_loop vector
> being 76 elements, and 90% of the bsearch calls are for length() <= 4.
>
> The executive summary is that there's probably a saving from handling
> short vectors in for_all_locs_in_loop, but the milliseconds it would shave
> off of a bootstrap probably wouldn't be worth the added complexity.
> Please let me know if you disagree, and I'll submit a follow-up patch for
> that change.


Heh, so I was merely asking because ref_always_accessed_p walks over
the set of locations which in principle doesn't have any bounded size.
I didn't consider the complexity of caching, a "simple" way might be

int always_stored = -1;
  if (...
      && !(always_stored = ref_always_stored_p (loop, ref)))

...

    if (always_stored == -1 ? ref_always_stored_p (loop, ref) : always_stored)

I suppose C++ now might have "promises" to make this less ugly?  Aka,
compute-on-use.


>
> There is however a minor code clean-up that saves a few microseconds,
> but as this reduces code complexity there's no reason not to propose it.
> ref_always_accessed_p is (currently) only ever called with stored_p being
> true, so specializing for this case, renaming ref_always_accessed{,_p} to
> ref_always_stored{,_p} saves storage and some redundant checks at run-time.

The patch below is OK.  For the original patch I'm still asking to
check loop-nest_optimized_for_size_p () instead of function-for-size.

Richard.

>
> 2026-05-16  Roger Sayle  <[email protected]>
>
> gcc/ChangeLog
>         * tree-ssa-loop-im.cc (ref_always_accessed_p): Rename to...
>         (ref_always_stored_p): New function specialized to determine if
>         REF is a store that is always executed in LOOP.
>         (execute_sm): Use ref_always_stored_p instead of
>         ref_always_accessed_p.
>         (class ref_always_accessed): Rename to..
>         (class ref_always_stored): Remove (always true) stored_p field.
>         (ref_always_stored::operator ()): Always check for a store.
>         Move hash table lookup , get_lim_data, after store test.
>         (can_sm_ref_p): Use ref_always_stored_p insead of
>         ref_always_accessed_p.
>
>
> Please let me know what you think.
> If you could approve this patch, and the previous one (as written) that
> would be much appreciated.  If there's a compile-hog test case that can
> be reduced and added to the testsuite, that would be great, but I doubt
> that -Os is used significantly on the types of scientific applications that
> have hugely complicated loop nests.
>
> [Perhaps this was a test; an AI coding assistant would blindly add
> caching whether it was needed or not].
>
> Cheers,
> Roger
> --
>
> > -----Original Message-----
> > From: Richard Biener <[email protected]>
> > Sent: 11 May 2026 14:01
> > To: Roger Sayle <[email protected]>
> > Cc: [email protected]
> > Subject: Re: [PATCH] PR tree-optimization/112508: Intelligently throttle 
> > loop store
> > motion with -Os
> >
> > On Mon, May 11, 2026 at 10:46 AM Roger Sayle <[email protected]>
> > wrote:
> > >
> > >
> > > This patch is my (initial) solution to PR tree-optimization/112508,
> > > the observation that tree-ssa's loop store motion frequently increases
> > > the size of a function and therefore plays poorly with -Os and -Oz.
> > > There are two challenges that complicate things, and prevent simply
> > > disabling this pass (equivalent to
> > > -fno-move-loop-stores)
> > > for being an ideal solution.  There first is that store motion is not
> > > universally bad, sometimes it helps, but often it doesn't.  The second
> > > challenge is that this pass also performs analyses and other invariant
> > > motion that helps later passes.
> > >
> > > To demonstrate this delicate balance consider the following two (tiny)
> > > loops inspired by CSiBE's linux kernel benchmark (which aren't 
> > > equivalent).
> > >
> > > void loop1 (short i) {
> > >   for (;i;i--)
> > >     _wdtc.bit.WTE = 0;
> > > }
> > >
> > > void loop2 (short i) {
> > >   do {
> > >     _wdtc.bit.WTE = 0;
> > >   } while (--i);
> > > }
> > >
> > > Currently on x86_64, with -Os loop1 is 25 bytes and loop2 is 8 bytes.
> > > Adding the
> > > -fno-move-loop-stores flag decreases loop1 to 17 bytes, but increases
> > > loop2 to 13 bytes.  Without store motion we fail to eliminate the
> > > loop.
> > >
> > > The correct solution is to intelligently determine whether a
> > > particular loop store motion is space saving or not.  This patch adds
> > > an extra clause to the predicate can_sm_ref_p when optimizing the
> > > function for size, to restrict store motion to unconditional stores in
> > > single exit loops.  Moving a store by duplicating it on multiple loop
> > > exit edges can obviously increase size.  Likewise, the additional
> > > logic (and flag variable) for when a store is conditionally executed (i.e.
> > > only
> > > executed sometimes) requires extra instructions not present in the
> > > original code.
> > >
> > > With this patch, using just -Os, loop1 above is 17 bytes and loop2 is
> > > 8 bytes (i.e. the best of both worlds).
> > >
> > > Importantly, this change gives the store motion pass some logic that
> > > can be tweaked and refined in future, if examples requiring more
> > > complex decision making are discovered.  For example, when
> > > optimize_loop_for_size returns -Oz (and the enclosing function or loop
> > > is less aggressively optimized for size) could potentially be
> > > interpreted as a hint to always perform store motion, minimizing the
> > > size of the loop body, even at the expense of larger total code size.
> > > Some comments in the PR mention hot vs. cold basic blocks, but this
> > > only affects performance, and isn't relevant for -Os, i.e. (total)
> > > code size.
> > >
> > >
> > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > > and make -k check, both with and without --target_board=unix{-m32}
> > > with no new failures.  Ok for mainline?
> >
> > +  /* Store motion decreases the size of the loop, but often increases
> > +     the size of the function.  If optimizing the function for size,
> > +     be careful about which REFs to move.  */  if
> > + (optimize_function_for_size_p (cfun))
> > +    {
> >
> > can you use optimize_loop_nest_for_size_p (loop) here?
> >
> > Please also cache ref_always_accessed_p, it's a quite expensive walk over 
> > all
> > accesses.
> >
> > I think the patch is OK with those two adjustments.
> >
> > I suppose one should weight the number of exits against the number of 
> > accesses
> > in the loop - those get replaced by reg-reg copies.  For conditional 
> > accesses in the
> > loop there's the opportunity to if-convert some blocks - unsure if that 
> > would save
> > code-size though.
> >
> > One of the usual complaints with store motion is the effect on register 
> > pressure
> > and spilling that's eventually caused.  A first step to address this would 
> > be to rank
> > store motion candidates based on (weighted?) number of loads/stores 
> > eliminated,
> > so one can still move the first N important candidates.
> >
> > Richard.
> >
> > >
> > > 2026-05-11  Roger Sayle  <[email protected]>
> > >
> > > gcc/ChangeLog
> > >         PR tree-optimization/112508
> > >         * tree-ssa-loop-im.cc (can_sm_ref_p): When optimizing for
> > > size, only move
> > >         unconditional stores from loops with a single exit.
> > >
> > > gcc/testsuite/ChangeLog
> > >         * gcc.target/i386/pr112508-1.c: New test case.
> > >         * gcc.target/i386/pr112508-2.c: Likewise.
> > >
> > >
> > > Thanks in advance,
> > > Roger
> > > --
> > >

Re: [PATCH] PR tree-optimization/112508: Intelligently throttle loop store motion with -Os

Reply via email to