[PATCH] PR tree-optimization/112508: Intelligently throttle loop store motion with -Os

Roger Sayle Mon, 11 May 2026 01:45:16 -0700

This patch is my (initial) solution to PR tree-optimization/112508, the
observation
that tree-ssa's loop store motion frequently increases the size of a
function and
therefore plays poorly with -Os and -Oz.  There are two challenges that
complicate
things, and prevent simply disabling this pass (equivalent to
-fno-move-loop-stores)
for being an ideal solution.  There first is that store motion is not
universally
bad, sometimes it helps, but often it doesn't.  The second challenge is that
this
pass also performs analyses and other invariant motion that helps later
passes.


To demonstrate this delicate balance consider the following two (tiny) loops
inspired by CSiBE's linux kernel benchmark (which aren't equivalent).

void loop1 (short i) {
  for (;i;i--)
    _wdtc.bit.WTE = 0;
}

void loop2 (short i) {
  do {
    _wdtc.bit.WTE = 0;
  } while (--i);
}

Currently on x86_64, with -Os loop1 is 25 bytes and loop2 is 8 bytes.
Adding the
-fno-move-loop-stores flag decreases loop1 to 17 bytes, but increases loop2
to 13
bytes.  Without store motion we fail to eliminate the loop.

The correct solution is to intelligently determine whether a particular loop
store
motion is space saving or not.  This patch adds an extra clause to the
predicate
can_sm_ref_p when optimizing the function for size, to restrict store motion
to
unconditional stores in single exit loops.  Moving a store by duplicating it
on
multiple loop exit edges can obviously increase size.  Likewise, the
additional
logic (and flag variable) for when a store is conditionally executed (i.e.
only
executed sometimes) requires extra instructions not present in the original
code.

With this patch, using just -Os, loop1 above is 17 bytes and loop2 is 8
bytes
(i.e. the best of both worlds).

Importantly, this change gives the store motion pass some logic that can be
tweaked and refined in future, if examples requiring more complex decision
making are discovered.  For example, when optimize_loop_for_size returns -Oz
(and the enclosing function or loop is less aggressively optimized for size)
could potentially be interpreted as a hint to always perform store motion,
minimizing the size of the loop body, even at the expense of larger total
code
size.  Some comments in the PR mention hot vs. cold basic blocks, but this
only affects performance, and isn't relevant for -Os, i.e. (total) code
size.


This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32} with
no new failures.  Ok for mainline?

2026-05-11  Roger Sayle  <[email protected]>

gcc/ChangeLog
        PR tree-optimization/112508
        * tree-ssa-loop-im.cc (can_sm_ref_p): When optimizing for size, only
move
        unconditional stores from loops with a single exit.

gcc/testsuite/ChangeLog
        * gcc.target/i386/pr112508-1.c: New test case.
        * gcc.target/i386/pr112508-2.c: Likewise.


Thanks in advance,
Roger
--

diff --git a/gcc/tree-ssa-loop-im.cc b/gcc/tree-ssa-loop-im.cc
index 72e19981698..26517c63591 100644
--- a/gcc/tree-ssa-loop-im.cc
+++ b/gcc/tree-ssa-loop-im.cc
@@ -3400,6 +3400,16 @@ can_sm_ref_p (class loop *loop, im_mem_ref *ref)
   if (!for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body (loop)))
     return false;
 
+  /* Store motion decreases the size of the loop, but often increases
+     the size of the function.  If optimizing the function for size,
+     be careful about which REFs to move.  */
+  if (optimize_function_for_size_p (cfun))
+    {
+      if (!single_exit (loop)
+         || !ref_always_accessed_p (loop, ref, true))
+       return false;
+    }
+
   return true;
 }
 
diff --git a/gcc/testsuite/gcc.target/i386/pr112508-1.c 
b/gcc/testsuite/gcc.target/i386/pr112508-1.c
new file mode 100644
index 00000000000..3459ad9137e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr112508-1.c
@@ -0,0 +1,39 @@
+/* { dg-do compile } */
+/* { dg-options "-Os" } */
+
+typedef unsigned char IO_BYTE;
+
+typedef union{   /* Watch Dog */
+    IO_BYTE     byte;
+    struct{
+    IO_BYTE WT0 :1;
+    IO_BYTE WT1 :1;
+    IO_BYTE WTE :1;
+    IO_BYTE SRST :1;
+    IO_BYTE ERST :1;
+    IO_BYTE WRST :1;
+    IO_BYTE STBR :1;
+    IO_BYTE PONR :1;
+  }bit;
+  struct{
+    IO_BYTE WT :2;
+  }bitc;
+ }WDTCSTR;
+
+WDTCSTR _wdtc;
+#define WDTC_WTE _wdtc.bit.WTE
+
+
+void kick_WD (void)
+{
+  WDTC_WTE=0;
+}
+
+void wait (short i)
+{
+  for(;i;i--) kick_WD();
+}
+
+/* { dg-final { scan-assembler-not "movb\[ \\t\]+\\\$1, %al" } } */
+/* { dg-final { scan-assembler-not "testb\[ \\t\]+%al, %al" } } */
+
diff --git a/gcc/testsuite/gcc.target/i386/pr112508-2.c 
b/gcc/testsuite/gcc.target/i386/pr112508-2.c
new file mode 100644
index 00000000000..126f4129dfe
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr112508-2.c
@@ -0,0 +1,38 @@
+/* { dg-do compile } */
+/* { dg-options "-Os" } */
+
+typedef unsigned char IO_BYTE;
+
+typedef union{   /* Watch Dog */
+    IO_BYTE     byte;
+    struct{
+    IO_BYTE WT0 :1;
+    IO_BYTE WT1 :1;
+    IO_BYTE WTE :1;
+    IO_BYTE SRST :1;
+    IO_BYTE ERST :1;
+    IO_BYTE WRST :1;
+    IO_BYTE STBR :1;
+    IO_BYTE PONR :1;
+  }bit;
+  struct{
+    IO_BYTE WT :2;
+  }bitc;
+ }WDTCSTR;
+
+WDTCSTR _wdtc;
+#define WDTC_WTE _wdtc.bit.WTE
+
+void kick_WD (void)
+{
+  WDTC_WTE=0;
+}
+
+void wait (short i)
+{
+  do { kick_WD(); } while (--i);
+}
+
+/* { dg-final { scan-assembler-not "movzbl" } } */
+/* { dg-final { scan-assembler-not "andl" } } */
+

[PATCH] PR tree-optimization/112508: Intelligently throttle loop store motion with -Os

Reply via email to