RE: [PATCH] PR tree-optimization/112508: Intelligently throttle loop store motion with -Os

Roger Sayle Sat, 16 May 2026 05:32:57 -0700

Hi Richard,
Many thanks for the review.

> Please also cache ref_always_accessed_p, it's a quite expensive walk over all
> accesses.

Can you be a bit more specific about why you think it's expensive to walk over
accesses (which are already indexed by ref in vectors), and where/how you'd
like to see caching implemented?  The reason I ask, is that after much profiling
and instrumentation, I don't see a significant amount of time being in spent in
this code.

In a stage2 and stage3 of a GCC bootstrap, for_all_locs_in_loop is called
74433 times, calls the functor FN 154251 times, of which 125977 (81.7%)
are ref_always_accessed.  53260 (42%) of these calls fail the store check,
62904 (50%) calls are not always executed, 9125 (7%) calls are always
executed in this loop or a subloop, and 688 (0.5%) calls always executed
but outside of this loop.

Investigating the use of bsearch in for_all_locs_in_loop, the comparison
function is executed 150818 times (averaging 2 times per bsearch), with
the largest ref->accesses_in_loop vector being 244 elements.  The
distribution in access_in_loop lengths is very skewed, with 61% of the
calls to bsearch being on vectors of length 4 or less.

len=1 8260/74433 (11%)
len=2 23054/74433 (31%)
len=3 5750/74433 (8%)
len=4 8331/74433 (11%)

Running the testsuite with "make -k check", for_all_locs_in_loop is called
171306 times, calls the functor FN 247866 times.  The comparator is called
204502 times (1.2 times per bsearch), the largest accesses_in_loop vector
being 76 elements, and 90% of the bsearch calls are for length() <= 4.

The executive summary is that there's probably a saving from handling
short vectors in for_all_locs_in_loop, but the milliseconds it would shave
off of a bootstrap probably wouldn't be worth the added complexity.
Please let me know if you disagree, and I'll submit a follow-up patch for
that change.

There is however a minor code clean-up that saves a few microseconds,
but as this reduces code complexity there's no reason not to propose it.
ref_always_accessed_p is (currently) only ever called with stored_p being
true, so specializing for this case, renaming ref_always_accessed{,_p} to
ref_always_stored{,_p} saves storage and some redundant checks at run-time.

2026-05-16  Roger Sayle  <[email protected]>

gcc/ChangeLog
        * tree-ssa-loop-im.cc (ref_always_accessed_p): Rename to...
        (ref_always_stored_p): New function specialized to determine if
        REF is a store that is always executed in LOOP.
        (execute_sm): Use ref_always_stored_p instead of
        ref_always_accessed_p.
        (class ref_always_accessed): Rename to..
        (class ref_always_stored): Remove (always true) stored_p field.
        (ref_always_stored::operator ()): Always check for a store.
        Move hash table lookup , get_lim_data, after store test.
        (can_sm_ref_p): Use ref_always_stored_p insead of
        ref_always_accessed_p.

Please let me know what you think.
If you could approve this patch, and the previous one (as written) that
would be much appreciated.  If there's a compile-hog test case that can
be reduced and added to the testsuite, that would be great, but I doubt
that -Os is used significantly on the types of scientific applications that
have hugely complicated loop nests.

[Perhaps this was a test; an AI coding assistant would blindly add
caching whether it was needed or not].

Cheers,
Roger
--

> -----Original Message-----
> From: Richard Biener <[email protected]>
> Sent: 11 May 2026 14:01
> To: Roger Sayle <[email protected]>
> Cc: [email protected]
> Subject: Re: [PATCH] PR tree-optimization/112508: Intelligently throttle loop 
> store
> motion with -Os
> 
> On Mon, May 11, 2026 at 10:46 AM Roger Sayle <[email protected]>
> wrote:
> >
> >
> > This patch is my (initial) solution to PR tree-optimization/112508,
> > the observation that tree-ssa's loop store motion frequently increases
> > the size of a function and therefore plays poorly with -Os and -Oz.
> > There are two challenges that complicate things, and prevent simply
> > disabling this pass (equivalent to
> > -fno-move-loop-stores)
> > for being an ideal solution.  There first is that store motion is not
> > universally bad, sometimes it helps, but often it doesn't.  The second
> > challenge is that this pass also performs analyses and other invariant
> > motion that helps later passes.
> >
> > To demonstrate this delicate balance consider the following two (tiny)
> > loops inspired by CSiBE's linux kernel benchmark (which aren't equivalent).
> >
> > void loop1 (short i) {
> >   for (;i;i--)
> >     _wdtc.bit.WTE = 0;
> > }
> >
> > void loop2 (short i) {
> >   do {
> >     _wdtc.bit.WTE = 0;
> >   } while (--i);
> > }
> >
> > Currently on x86_64, with -Os loop1 is 25 bytes and loop2 is 8 bytes.
> > Adding the
> > -fno-move-loop-stores flag decreases loop1 to 17 bytes, but increases
> > loop2 to 13 bytes.  Without store motion we fail to eliminate the
> > loop.
> >
> > The correct solution is to intelligently determine whether a
> > particular loop store motion is space saving or not.  This patch adds
> > an extra clause to the predicate can_sm_ref_p when optimizing the
> > function for size, to restrict store motion to unconditional stores in
> > single exit loops.  Moving a store by duplicating it on multiple loop
> > exit edges can obviously increase size.  Likewise, the additional
> > logic (and flag variable) for when a store is conditionally executed (i.e.
> > only
> > executed sometimes) requires extra instructions not present in the
> > original code.
> >
> > With this patch, using just -Os, loop1 above is 17 bytes and loop2 is
> > 8 bytes (i.e. the best of both worlds).
> >
> > Importantly, this change gives the store motion pass some logic that
> > can be tweaked and refined in future, if examples requiring more
> > complex decision making are discovered.  For example, when
> > optimize_loop_for_size returns -Oz (and the enclosing function or loop
> > is less aggressively optimized for size) could potentially be
> > interpreted as a hint to always perform store motion, minimizing the
> > size of the loop body, even at the expense of larger total code size.
> > Some comments in the PR mention hot vs. cold basic blocks, but this
> > only affects performance, and isn't relevant for -Os, i.e. (total)
> > code size.
> >
> >
> > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > and make -k check, both with and without --target_board=unix{-m32}
> > with no new failures.  Ok for mainline?
> 
> +  /* Store motion decreases the size of the loop, but often increases
> +     the size of the function.  If optimizing the function for size,
> +     be careful about which REFs to move.  */  if
> + (optimize_function_for_size_p (cfun))
> +    {
> 
> can you use optimize_loop_nest_for_size_p (loop) here?
> 
> Please also cache ref_always_accessed_p, it's a quite expensive walk over all
> accesses.
> 
> I think the patch is OK with those two adjustments.
> 
> I suppose one should weight the number of exits against the number of accesses
> in the loop - those get replaced by reg-reg copies.  For conditional accesses 
> in the
> loop there's the opportunity to if-convert some blocks - unsure if that would 
> save
> code-size though.
> 
> One of the usual complaints with store motion is the effect on register 
> pressure
> and spilling that's eventually caused.  A first step to address this would be 
> to rank
> store motion candidates based on (weighted?) number of loads/stores 
> eliminated,
> so one can still move the first N important candidates.
> 
> Richard.
> 
> >
> > 2026-05-11  Roger Sayle  <[email protected]>
> >
> > gcc/ChangeLog
> >         PR tree-optimization/112508
> >         * tree-ssa-loop-im.cc (can_sm_ref_p): When optimizing for
> > size, only move
> >         unconditional stores from loops with a single exit.
> >
> > gcc/testsuite/ChangeLog
> >         * gcc.target/i386/pr112508-1.c: New test case.
> >         * gcc.target/i386/pr112508-2.c: Likewise.
> >
> >
> > Thanks in advance,
> > Roger
> > --
> >

diff --git a/gcc/tree-ssa-loop-im.cc b/gcc/tree-ssa-loop-im.cc
index 72e19981698..b021329c605 100644
--- a/gcc/tree-ssa-loop-im.cc
+++ b/gcc/tree-ssa-loop-im.cc
@@ -262,7 +262,7 @@ static bitmap_obstack lim_bitmap_obstack;
 static obstack mem_ref_obstack;
 
 static bool ref_indep_loop_p (class loop *, im_mem_ref *, dep_kind);
-static bool ref_always_accessed_p (class loop *, im_mem_ref *, bool);
+static bool ref_always_stored_p (class loop *, im_mem_ref *);
 static bool refs_independent_p (im_mem_ref *, im_mem_ref *, bool = true);
 
 /* Minimum cost of an expensive expression.  */
@@ -2320,7 +2320,7 @@ execute_sm (class loop *loop, im_mem_ref *ref,
   fmt_data.orig_loop = loop;
   for_each_index (&ref->mem.ref, force_move_till, &fmt_data);
 
-  bool always_stored = ref_always_accessed_p (loop, ref, true);
+  bool always_stored = ref_always_stored_p (loop, ref);
   if (maybe_mt
       && (bb_in_transaction (loop_preheader_edge (loop)->src)
          || (ref_can_have_store_data_races (ref->mem.ref) && ! always_stored)))
@@ -3098,36 +3098,29 @@ hoist_memory_references (class loop *loop, bitmap 
mem_refs,
     delete (*iter).second;
 }
 
-class ref_always_accessed
+class ref_always_stored
 {
 public:
-  ref_always_accessed (class loop *loop_, bool stored_p_)
-      : loop (loop_), stored_p (stored_p_) {}
+  ref_always_stored (class loop *loop_)
+      : loop (loop_) {}
   bool operator () (mem_ref_loc *loc);
   class loop *loop;
-  bool stored_p;
 };
 
 bool
-ref_always_accessed::operator () (mem_ref_loc *loc)
+ref_always_stored::operator () (mem_ref_loc *loc)
 {
-  class loop *must_exec;
+  /* Make sure the statement is a store.  */
+  tree lhs = gimple_get_lhs (loc->stmt);
+  if (!lhs
+      || !(DECL_P (lhs) || REFERENCE_CLASS_P (lhs)))
+    return false;
 
   struct lim_aux_data *lim_data = get_lim_data (loc->stmt);
   if (!lim_data)
     return false;
 
-  /* If we require an always executed store make sure the statement
-     is a store.  */
-  if (stored_p)
-    {
-      tree lhs = gimple_get_lhs (loc->stmt);
-      if (!lhs
-         || !(DECL_P (lhs) || REFERENCE_CLASS_P (lhs)))
-       return false;
-    }
-
-  must_exec = lim_data->always_executed_in;
+  class loop *must_exec = lim_data->always_executed_in;
   if (!must_exec)
     return false;
 
@@ -3138,14 +3131,12 @@ ref_always_accessed::operator () (mem_ref_loc *loc)
   return false;
 }
 
-/* Returns true if REF is always accessed in LOOP.  If STORED_P is true
-   make sure REF is always stored to in LOOP.  */
+/* Returns true if REF is always stored in LOOP.  */
 
 static bool
-ref_always_accessed_p (class loop *loop, im_mem_ref *ref, bool stored_p)
+ref_always_stored_p (class loop *loop, im_mem_ref *ref)
 {
-  return for_all_locs_in_loop (loop, ref,
-                              ref_always_accessed (loop, stored_p));
+  return for_all_locs_in_loop (loop, ref, ref_always_stored (loop));
 }
 
 /* Returns true if LOAD_REF and STORE_REF form a "self write" pattern
@@ -3378,7 +3369,7 @@ can_sm_ref_p (class loop *loop, im_mem_ref *ref)
         stored to later anyway.  So this would only guard
         the load we need to emit.  Thus when the ref is not
         loaded we can elide this completely?  */
-      && !ref_always_accessed_p (loop, ref, true))
+      && !ref_always_stored_p (loop, ref))
     return false;
 
   /* Verify all loads of ref can be hoisted.  */

RE: [PATCH] PR tree-optimization/112508: Intelligently throttle loop store motion with -Os

Reply via email to