Re: [PATCH] PR tree-optimization/112508: Intelligently throttle loop store motion with -Os

Richard Biener Mon, 11 May 2026 06:02:21 -0700

On Mon, May 11, 2026 at 10:46 AM Roger Sayle <[email protected]> wrote:
>
>
> This patch is my (initial) solution to PR tree-optimization/112508, the
> observation
> that tree-ssa's loop store motion frequently increases the size of a
> function and
> therefore plays poorly with -Os and -Oz.  There are two challenges that
> complicate
> things, and prevent simply disabling this pass (equivalent to
> -fno-move-loop-stores)
> for being an ideal solution.  There first is that store motion is not
> universally
> bad, sometimes it helps, but often it doesn't.  The second challenge is that
> this
> pass also performs analyses and other invariant motion that helps later
> passes.
>
> To demonstrate this delicate balance consider the following two (tiny) loops
> inspired by CSiBE's linux kernel benchmark (which aren't equivalent).
>
> void loop1 (short i) {
>   for (;i;i--)
>     _wdtc.bit.WTE = 0;
> }
>
> void loop2 (short i) {
>   do {
>     _wdtc.bit.WTE = 0;
>   } while (--i);
> }
>
> Currently on x86_64, with -Os loop1 is 25 bytes and loop2 is 8 bytes.
> Adding the
> -fno-move-loop-stores flag decreases loop1 to 17 bytes, but increases loop2
> to 13
> bytes.  Without store motion we fail to eliminate the loop.
>
> The correct solution is to intelligently determine whether a particular loop
> store
> motion is space saving or not.  This patch adds an extra clause to the
> predicate
> can_sm_ref_p when optimizing the function for size, to restrict store motion
> to
> unconditional stores in single exit loops.  Moving a store by duplicating it
> on
> multiple loop exit edges can obviously increase size.  Likewise, the
> additional
> logic (and flag variable) for when a store is conditionally executed (i.e.
> only
> executed sometimes) requires extra instructions not present in the original
> code.
>
> With this patch, using just -Os, loop1 above is 17 bytes and loop2 is 8
> bytes
> (i.e. the best of both worlds).
>
> Importantly, this change gives the store motion pass some logic that can be
> tweaked and refined in future, if examples requiring more complex decision
> making are discovered.  For example, when optimize_loop_for_size returns -Oz
> (and the enclosing function or loop is less aggressively optimized for size)
> could potentially be interpreted as a hint to always perform store motion,
> minimizing the size of the loop body, even at the expense of larger total
> code
> size.  Some comments in the PR mention hot vs. cold basic blocks, but this
> only affects performance, and isn't relevant for -Os, i.e. (total) code
> size.
>
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32} with
> no new failures.  Ok for mainline?


+  /* Store motion decreases the size of the loop, but often increases
+     the size of the function.  If optimizing the function for size,
+     be careful about which REFs to move.  */
+  if (optimize_function_for_size_p (cfun))
+    {

can you use optimize_loop_nest_for_size_p (loop) here?

Please also cache ref_always_accessed_p, it's a quite expensive
walk over all accesses.

I think the patch is OK with those two adjustments.

I suppose one should weight the number of exits against the
number of accesses in the loop - those get replaced by
reg-reg copies.  For conditional accesses in the loop there's
the opportunity to if-convert some blocks - unsure if that
would save code-size though.

One of the usual complaints with store motion is the
effect on register pressure and spilling that's eventually
caused.  A first step to address this would be to rank
store motion candidates based on (weighted?) number of
loads/stores eliminated, so one can still move the first N
important candidates.

Richard.

>
> 2026-05-11  Roger Sayle  <[email protected]>
>
> gcc/ChangeLog
>         PR tree-optimization/112508
>         * tree-ssa-loop-im.cc (can_sm_ref_p): When optimizing for size, only
> move
>         unconditional stores from loops with a single exit.
>
> gcc/testsuite/ChangeLog
>         * gcc.target/i386/pr112508-1.c: New test case.
>         * gcc.target/i386/pr112508-2.c: Likewise.
>
>
> Thanks in advance,
> Roger
> --
>

Re: [PATCH] PR tree-optimization/112508: Intelligently throttle loop store motion with -Os

Reply via email to