Hi Richard,
Many thanks for the review.
> Please also cache ref_always_accessed_p, it's a quite expensive walk over all
> accesses.
Can you be a bit more specific about why you think it's expensive to walk over
accesses (which are already indexed by ref in vectors), and where/how you'd
like to see caching implemented? The reason I ask, is that after much profiling
and instrumentation, I don't see a significant amount of time being in spent in
this code.
In a stage2 and stage3 of a GCC bootstrap, for_all_locs_in_loop is called
74433 times, calls the functor FN 154251 times, of which 125977 (81.7%)
are ref_always_accessed. 53260 (42%) of these calls fail the store check,
62904 (50%) calls are not always executed, 9125 (7%) calls are always
executed in this loop or a subloop, and 688 (0.5%) calls always executed
but outside of this loop.
Investigating the use of bsearch in for_all_locs_in_loop, the comparison
function is executed 150818 times (averaging 2 times per bsearch), with
the largest ref->accesses_in_loop vector being 244 elements. The
distribution in access_in_loop lengths is very skewed, with 61% of the
calls to bsearch being on vectors of length 4 or less.
len=1 8260/74433 (11%)
len=2 23054/74433 (31%)
len=3 5750/74433 (8%)
len=4 8331/74433 (11%)
Running the testsuite with "make -k check", for_all_locs_in_loop is called
171306 times, calls the functor FN 247866 times. The comparator is called
204502 times (1.2 times per bsearch), the largest accesses_in_loop vector
being 76 elements, and 90% of the bsearch calls are for length() <= 4.
The executive summary is that there's probably a saving from handling
short vectors in for_all_locs_in_loop, but the milliseconds it would shave
off of a bootstrap probably wouldn't be worth the added complexity.
Please let me know if you disagree, and I'll submit a follow-up patch for
that change.
There is however a minor code clean-up that saves a few microseconds,
but as this reduces code complexity there's no reason not to propose it.
ref_always_accessed_p is (currently) only ever called with stored_p being
true, so specializing for this case, renaming ref_always_accessed{,_p} to
ref_always_stored{,_p} saves storage and some redundant checks at run-time.
2026-05-16 Roger Sayle <[email protected]>
gcc/ChangeLog
* tree-ssa-loop-im.cc (ref_always_accessed_p): Rename to...
(ref_always_stored_p): New function specialized to determine if
REF is a store that is always executed in LOOP.
(execute_sm): Use ref_always_stored_p instead of
ref_always_accessed_p.
(class ref_always_accessed): Rename to..
(class ref_always_stored): Remove (always true) stored_p field.
(ref_always_stored::operator ()): Always check for a store.
Move hash table lookup , get_lim_data, after store test.
(can_sm_ref_p): Use ref_always_stored_p insead of
ref_always_accessed_p.
Please let me know what you think.
If you could approve this patch, and the previous one (as written) that
would be much appreciated. If there's a compile-hog test case that can
be reduced and added to the testsuite, that would be great, but I doubt
that -Os is used significantly on the types of scientific applications that
have hugely complicated loop nests.
[Perhaps this was a test; an AI coding assistant would blindly add
caching whether it was needed or not].
Cheers,
Roger
--
> -----Original Message-----
> From: Richard Biener <[email protected]>
> Sent: 11 May 2026 14:01
> To: Roger Sayle <[email protected]>
> Cc: [email protected]
> Subject: Re: [PATCH] PR tree-optimization/112508: Intelligently throttle loop
> store
> motion with -Os
>
> On Mon, May 11, 2026 at 10:46 AM Roger Sayle <[email protected]>
> wrote:
> >
> >
> > This patch is my (initial) solution to PR tree-optimization/112508,
> > the observation that tree-ssa's loop store motion frequently increases
> > the size of a function and therefore plays poorly with -Os and -Oz.
> > There are two challenges that complicate things, and prevent simply
> > disabling this pass (equivalent to
> > -fno-move-loop-stores)
> > for being an ideal solution. There first is that store motion is not
> > universally bad, sometimes it helps, but often it doesn't. The second
> > challenge is that this pass also performs analyses and other invariant
> > motion that helps later passes.
> >
> > To demonstrate this delicate balance consider the following two (tiny)
> > loops inspired by CSiBE's linux kernel benchmark (which aren't equivalent).
> >
> > void loop1 (short i) {
> > for (;i;i--)
> > _wdtc.bit.WTE = 0;
> > }
> >
> > void loop2 (short i) {
> > do {
> > _wdtc.bit.WTE = 0;
> > } while (--i);
> > }
> >
> > Currently on x86_64, with -Os loop1 is 25 bytes and loop2 is 8 bytes.
> > Adding the
> > -fno-move-loop-stores flag decreases loop1 to 17 bytes, but increases
> > loop2 to 13 bytes. Without store motion we fail to eliminate the
> > loop.
> >
> > The correct solution is to intelligently determine whether a
> > particular loop store motion is space saving or not. This patch adds
> > an extra clause to the predicate can_sm_ref_p when optimizing the
> > function for size, to restrict store motion to unconditional stores in
> > single exit loops. Moving a store by duplicating it on multiple loop
> > exit edges can obviously increase size. Likewise, the additional
> > logic (and flag variable) for when a store is conditionally executed (i.e.
> > only
> > executed sometimes) requires extra instructions not present in the
> > original code.
> >
> > With this patch, using just -Os, loop1 above is 17 bytes and loop2 is
> > 8 bytes (i.e. the best of both worlds).
> >
> > Importantly, this change gives the store motion pass some logic that
> > can be tweaked and refined in future, if examples requiring more
> > complex decision making are discovered. For example, when
> > optimize_loop_for_size returns -Oz (and the enclosing function or loop
> > is less aggressively optimized for size) could potentially be
> > interpreted as a hint to always perform store motion, minimizing the
> > size of the loop body, even at the expense of larger total code size.
> > Some comments in the PR mention hot vs. cold basic blocks, but this
> > only affects performance, and isn't relevant for -Os, i.e. (total)
> > code size.
> >
> >
> > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > and make -k check, both with and without --target_board=unix{-m32}
> > with no new failures. Ok for mainline?
>
> + /* Store motion decreases the size of the loop, but often increases
> + the size of the function. If optimizing the function for size,
> + be careful about which REFs to move. */ if
> + (optimize_function_for_size_p (cfun))
> + {
>
> can you use optimize_loop_nest_for_size_p (loop) here?
>
> Please also cache ref_always_accessed_p, it's a quite expensive walk over all
> accesses.
>
> I think the patch is OK with those two adjustments.
>
> I suppose one should weight the number of exits against the number of accesses
> in the loop - those get replaced by reg-reg copies. For conditional accesses
> in the
> loop there's the opportunity to if-convert some blocks - unsure if that would
> save
> code-size though.
>
> One of the usual complaints with store motion is the effect on register
> pressure
> and spilling that's eventually caused. A first step to address this would be
> to rank
> store motion candidates based on (weighted?) number of loads/stores
> eliminated,
> so one can still move the first N important candidates.
>
> Richard.
>
> >
> > 2026-05-11 Roger Sayle <[email protected]>
> >
> > gcc/ChangeLog
> > PR tree-optimization/112508
> > * tree-ssa-loop-im.cc (can_sm_ref_p): When optimizing for
> > size, only move
> > unconditional stores from loops with a single exit.
> >
> > gcc/testsuite/ChangeLog
> > * gcc.target/i386/pr112508-1.c: New test case.
> > * gcc.target/i386/pr112508-2.c: Likewise.
> >
> >
> > Thanks in advance,
> > Roger
> > --
> >
diff --git a/gcc/tree-ssa-loop-im.cc b/gcc/tree-ssa-loop-im.cc
index 72e19981698..b021329c605 100644
--- a/gcc/tree-ssa-loop-im.cc
+++ b/gcc/tree-ssa-loop-im.cc
@@ -262,7 +262,7 @@ static bitmap_obstack lim_bitmap_obstack;
static obstack mem_ref_obstack;
static bool ref_indep_loop_p (class loop *, im_mem_ref *, dep_kind);
-static bool ref_always_accessed_p (class loop *, im_mem_ref *, bool);
+static bool ref_always_stored_p (class loop *, im_mem_ref *);
static bool refs_independent_p (im_mem_ref *, im_mem_ref *, bool = true);
/* Minimum cost of an expensive expression. */
@@ -2320,7 +2320,7 @@ execute_sm (class loop *loop, im_mem_ref *ref,
fmt_data.orig_loop = loop;
for_each_index (&ref->mem.ref, force_move_till, &fmt_data);
- bool always_stored = ref_always_accessed_p (loop, ref, true);
+ bool always_stored = ref_always_stored_p (loop, ref);
if (maybe_mt
&& (bb_in_transaction (loop_preheader_edge (loop)->src)
|| (ref_can_have_store_data_races (ref->mem.ref) && ! always_stored)))
@@ -3098,36 +3098,29 @@ hoist_memory_references (class loop *loop, bitmap
mem_refs,
delete (*iter).second;
}
-class ref_always_accessed
+class ref_always_stored
{
public:
- ref_always_accessed (class loop *loop_, bool stored_p_)
- : loop (loop_), stored_p (stored_p_) {}
+ ref_always_stored (class loop *loop_)
+ : loop (loop_) {}
bool operator () (mem_ref_loc *loc);
class loop *loop;
- bool stored_p;
};
bool
-ref_always_accessed::operator () (mem_ref_loc *loc)
+ref_always_stored::operator () (mem_ref_loc *loc)
{
- class loop *must_exec;
+ /* Make sure the statement is a store. */
+ tree lhs = gimple_get_lhs (loc->stmt);
+ if (!lhs
+ || !(DECL_P (lhs) || REFERENCE_CLASS_P (lhs)))
+ return false;
struct lim_aux_data *lim_data = get_lim_data (loc->stmt);
if (!lim_data)
return false;
- /* If we require an always executed store make sure the statement
- is a store. */
- if (stored_p)
- {
- tree lhs = gimple_get_lhs (loc->stmt);
- if (!lhs
- || !(DECL_P (lhs) || REFERENCE_CLASS_P (lhs)))
- return false;
- }
-
- must_exec = lim_data->always_executed_in;
+ class loop *must_exec = lim_data->always_executed_in;
if (!must_exec)
return false;
@@ -3138,14 +3131,12 @@ ref_always_accessed::operator () (mem_ref_loc *loc)
return false;
}
-/* Returns true if REF is always accessed in LOOP. If STORED_P is true
- make sure REF is always stored to in LOOP. */
+/* Returns true if REF is always stored in LOOP. */
static bool
-ref_always_accessed_p (class loop *loop, im_mem_ref *ref, bool stored_p)
+ref_always_stored_p (class loop *loop, im_mem_ref *ref)
{
- return for_all_locs_in_loop (loop, ref,
- ref_always_accessed (loop, stored_p));
+ return for_all_locs_in_loop (loop, ref, ref_always_stored (loop));
}
/* Returns true if LOAD_REF and STORE_REF form a "self write" pattern
@@ -3378,7 +3369,7 @@ can_sm_ref_p (class loop *loop, im_mem_ref *ref)
stored to later anyway. So this would only guard
the load we need to emit. Thus when the ref is not
loaded we can elide this completely? */
- && !ref_always_accessed_p (loop, ref, true))
+ && !ref_always_stored_p (loop, ref))
return false;
/* Verify all loads of ref can be hoisted. */