On 4/23/2025 6:01 PM, Artemiy Volkov wrote: > Hi all, > > sending a v2 of > https://gcc.gnu.org/pipermail/gcc-patches/2025-April/680893.html after > fixing several issues with the original patch. Namely, the changes > since v1 are: > > - Remove the call to df_finish_pass () at the end of split_exit () and > simply restore the previous value of the DF flags instead to avoid UAF > errors. > - Remove the call to df_analyze () at the end of split_exit () to save > compilation time. > - Under -O1, always call df_remove_problem (df_live) whenever there was > a corresponding call to df_live_add_problem (). > - Restore alphabetical order in common.opt{,.urls}. > > Could anyone please review/commit on my behalf if OK? > > Thanks, > Artemiy
Ping. Thanks, Artemiy > > -- >8 -- > > Consider the (lightly modified) core_list_reverse () function from Coremark: > > struct list_node * > core_list_reverse (struct list_node *list) > { > struct list_node *next = 0, *tmp; > #pragma GCC unroll 4 > while (list) > { > tmp = list->next; > list->next = next; > next = list; > list = tmp; > } > > return next; > } > > On AArch64, this compiles to the following: > > core_list_reverse: > cbz x0, .L2 > ldr x1, [x0] > mov x6, 0 > str x6, [x0] > mov x3, x0 > cbz x1, .L2 > .L4: > ldr x2, [x1] > str x3, [x1] > mov x0, x1 > cbz x2, .L2 > ldr x4, [x2] > str x1, [x2] > mov x0, x2 > cbz x4, .L2 > ... > mov x0, x5 > mov x3, x0 > ldr x1, [x0] > str x6, [x0] > cbnz x1, .L4 > .L2: > ret > > The 'next' variable lives in the x0 register, which is maintained by the > "mov x0, xR" instruction at every unrolled iteration. However, this can > be improved by removing the instruction and splitting the loop exit into > multiple exits, each corresponding to the hard register in which the > 'next' variable is at a particular iteration, so that the code looks > like: > > core_list_reverse: > cbz x0, .L2 > mov x1, 0 > .L3: > ldr x2, [x0] > str x1, [x0] > cbz x2, .L2 > ldr x3, [x2] > str x0, [x2] > cbz x3, .L13 > ... > ldr x0, [x1] > str x3, [x1] > cbnz x0, .L3 > mov x0, x1 > .L2: > ret > .L13: > mov x0, x2 > ret > .L14: > mov x0, x3 > ret > > This patch implements this transformation by splitting variables defined > in the loop and live at the (single) exit BB of an uncountable loop, > replacing each of those with a unique temporary pseudo inside the loop > and assigning these pseudos back to the original variables in the newly > split exit BBs (one per unrolled iteration). (This is the behavior that > the GIMPLE unroller would exhibit were it capable of handling > uncountable loops.) Afterwards, the cprop pass is able to propagate the > (split) loop variables and carry the move instructions out of the loop. > > This change is primarily intended for small in-order cores on which the > latency of the move isn't hidden by the latency of the load. The > optimization is guarded by the new -fsplit-exit-in-unroller flag that is > on by default. The flag has been documented in doc/invoke.texi, and the > common.opt.urls file has been regenerated. > > The change has been bootstrapped and regtested on i386, x86_64, and > aarch64, and additionally regtested on riscv32. Two new testcases have > been added to demonstrate operation and interaction with > -fvariable-expansion-in-unroller. On a Cortex-A53, this patch leads to > an ~0.5% improvement for Coremark and SPECINT2006 geomean (the only > regression being 483.xalancbmk), when compiled with -O2 > -funroll-all-loops. The compile-time increase is ~0.4%. > > PR rtl-optimization/119681 > > gcc/ChangeLog: > > * common.opt (-fsplit-exit-in-unroller): New flag. > * common.opt.urls: Regenerate. > * doc/invoke.texi (-fsplit-exit-in-unroller): Document it. > * loop-unroll.cc (split_exit): New function. > (regno_defined_inside_loop_p): New predicate. > (unroll_loop_stupid): Call split_exit (). > (has_use_with_multiple_defs_p): New predicate. > > gcc/testsuite/ChangeLog: > > * gcc.dg/loop-exit-split-1.c: New test. > * gcc.dg/loop-exit-split-2.c: New test. > > Signed-off-by: Artemiy Volkov <arte...@synopsys.com> > --- > gcc/common.opt | 4 + > gcc/common.opt.urls | 3 + > gcc/doc/invoke.texi | 12 +- > gcc/loop-unroll.cc | 196 +++++++++++++++++++++++ > gcc/testsuite/gcc.dg/loop-exit-split-1.c | 31 ++++ > gcc/testsuite/gcc.dg/loop-exit-split-2.c | 33 ++++ > 6 files changed, 277 insertions(+), 2 deletions(-) > create mode 100644 gcc/testsuite/gcc.dg/loop-exit-split-1.c > create mode 100644 gcc/testsuite/gcc.dg/loop-exit-split-2.c > > diff --git a/gcc/common.opt b/gcc/common.opt > index 88d987e6ab1..b9488a916a1 100644 > --- a/gcc/common.opt > +++ b/gcc/common.opt > @@ -2944,6 +2944,10 @@ fsingle-precision-constant > Common Var(flag_single_precision_constant) Optimization > Convert floating point constants to single precision constants. > > +fsplit-exit-in-unroller > +Common Var(flag_split_exit_in_unroller) Init(1) Optimization > +Split exit basic blocks of uncountable loops. > + > fsplit-ivs-in-unroller > Common Var(flag_split_ivs_in_unroller) Init(1) Optimization > Split lifetimes of induction variables when loops are unrolled. > diff --git a/gcc/common.opt.urls b/gcc/common.opt.urls > index 00775118e6f..516a81cb28e 100644 > --- a/gcc/common.opt.urls > +++ b/gcc/common.opt.urls > @@ -1328,6 +1328,9 @@ > UrlSuffix(gcc/Optimize-Options.html#index-fno-signed-zeros) > fsingle-precision-constant > UrlSuffix(gcc/Optimize-Options.html#index-fsingle-precision-constant) > > +fsplit-exit-in-unroller > +UrlSuffix(gcc/Optimize-Options.html#index-fsplit-exit-in-unroller) > + > fsplit-ivs-in-unroller > UrlSuffix(gcc/Optimize-Options.html#index-fsplit-ivs-in-unroller) > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi > index 020442aa032..0c85ce6ed3e 100644 > --- a/gcc/doc/invoke.texi > +++ b/gcc/doc/invoke.texi > @@ -637,8 +637,8 @@ Objective-C and Objective-C++ Dialects}. > -fsel-sched-pipelining -fsel-sched-pipelining-outer-loops > -fsemantic-interposition -fshrink-wrap -fshrink-wrap-separate > -fsignaling-nans > --fsingle-precision-constant -fsplit-ivs-in-unroller -fsplit-loops > --fsplit-paths > +-fsingle-precision-constant -fsplit-exit-in-unroller -fsplit-ivs-in-unroller > +-fsplit-loops -fsplit-paths > -fsplit-wide-types -fsplit-wide-types-early -fssa-backprop -fssa-phiopt > -fstdarg-opt -fstore-merging -fstrict-aliasing -fipa-strict-aliasing > -fthread-jumps -ftracer -ftree-bit-ccp > @@ -14484,6 +14484,14 @@ Split paths leading to loop backedges. This can > improve dead code > elimination and common subexpression elimination. This is enabled by > default at @option{-O3} and above. > > +@opindex fsplit-exit-in-unroller > +@item -fsplit-exit-in-unroller > +This option allows converting a single-exit uncountable loop into a > +multiple-exit one while splitting variables defined in the loop to improve > +propagation opportunities, saving move instructions inside the loop body. > + > +This optimization is enabled by default. > + > @opindex fsplit-ivs-in-unroller > @item -fsplit-ivs-in-unroller > Enables expression of values of induction variables in later iterations > diff --git a/gcc/loop-unroll.cc b/gcc/loop-unroll.cc > index b4952055318..822b3338af4 100644 > --- a/gcc/loop-unroll.cc > +++ b/gcc/loop-unroll.cc > @@ -35,6 +35,7 @@ along with GCC; see the file COPYING3. If not see > #include "dojump.h" > #include "expr.h" > #include "dumpfile.h" > +#include "df.h" > > /* This pass performs loop unrolling. We only perform this > optimization on innermost loops (with single exception) because > @@ -172,6 +173,9 @@ static void unroll_loop_runtime_iterations (class loop *); > static struct opt_info *analyze_insns_in_loop (class loop *); > static void opt_info_start_duplication (struct opt_info *); > static void apply_opt_in_copies (struct opt_info *, unsigned, bool, bool); > +static void split_exit (class loop *); > +static bool regno_defined_inside_loop_p (unsigned, class loop *); > +static bool has_use_with_multiple_defs_p (df_link *); > static void free_opt_info (struct opt_info *); > static struct var_to_expand *analyze_insn_to_expand_var (class loop*, > rtx_insn *); > static bool referenced_in_one_insn_in_loop_p (class loop *, rtx, int *); > @@ -1261,6 +1265,10 @@ unroll_loop_stupid (class loop *loop) > free_opt_info (opt_info); > } > > + /* Create multiple exits to uncover more propagation opportunities. */ > + if (flag_split_exit_in_unroller) > + split_exit (loop); > + > if (desc->simple_p) > { > /* We indeed may get here provided that there are nontrivial > assumptions > @@ -2121,6 +2129,194 @@ apply_opt_in_copies (struct opt_info *opt_info, > } > } > > +/* Return TRUE if there is an instruction defining REGNO > + inside loop LOOP. */ > + > +static bool > +regno_defined_inside_loop_p (unsigned regno, class loop *loop) > +{ > + for (df_ref def = DF_REG_DEF_CHAIN (regno); def; def = DF_REF_NEXT_REG > (def)) > + { > + if (DF_REF_IS_ARTIFICIAL (def)) > + continue; > + > + if (!flow_bb_inside_loop_p (loop, DF_REF_BB (def))) > + continue; > + > + if (dump_file) > + { > + fprintf (dump_file, ";; Found definition of "); > + print_simple_rtl (dump_file, regno_reg_rtx[regno]); > + fprintf (dump_file, "\n;; in BB %d (loop %d)\n", > + DF_REF_BB (def)->index, loop->num); > + } > + > + return true; > + } > + > + return false; > +} > + > +static bool > +has_use_with_multiple_defs_p (df_link *def) > +{ > + for (df_link *use = DF_REF_CHAIN (def->ref); > + use; > + use = use->next) > + if (DF_REF_CHAIN (use->ref)->next) > + return true; > + > + return false; > +} > + > +/* Determine if LOOP has multiple exits to the same BB (happens after > + unroll_loop_stupid ()), and if so, try to split DU chains for pseudos > + defined inside the loop and live at the entry to the exit BB. */ > +static void > +split_exit (class loop *loop) > +{ > + basic_block exit_bb = NULL; > + basic_block new_bb; > + edge e; > + unsigned j; > + unsigned regno; > + bitmap_iterator it; > + auto_bitmap pseudos_to_replace; > + auto_vec<basic_block> new_exit_bbs; > + auto_vec<edge> exit_edges; > + rtx_insn *insn; > + > + /* Sanity check: after unrolling, the loop should have more than > + one exit, all to the same basic block. */ > + exit_edges = get_loop_exit_edges (loop); > + if (exit_edges.length () <= 1) > + return; > + > + FOR_EACH_VEC_ELT (exit_edges, j, e) > + { > + if (!exit_bb) > + exit_bb = e->dest; > + else if (exit_bb != e->dest) > + return; > + } > + > + if (dump_file) > + fprintf (dump_file, "\n;; Splitting loop exit for loop %d\n", loop->num); > + > + /* Construct a bitmap of pseudos that are live in at the exit BB > + and are defined within the loop. */ > + if (optimize == 1) > + df_live_add_problem (); > + > + df_live_set_all_dirty (); > + df_set_blocks (NULL); > + df_analyze (); > + > + EXECUTE_IF_SET_IN_BITMAP (DF_LIVE_IN (exit_bb), > + FIRST_PSEUDO_REGISTER, regno, it) > + if (regno_defined_inside_loop_p (regno, loop)) > + { > + if (dump_file) > + fprintf (dump_file, > + ";; added %u to list of pseudos to replace\n", regno); > + > + bitmap_set_bit (pseudos_to_replace, regno); > + } > + > + if (optimize == 1) > + df_remove_problem (df_live); > + > + if (bitmap_empty_p (pseudos_to_replace)) > + { > + if (dump_file) > + fprintf (dump_file, ";; no pseudos to replace, exiting\n"); > + return; > + } > + > + /* Split edge + insert copy instructions for each pseudo in > + pseudos_to_replace. */ > + FOR_EACH_VEC_ELT (exit_edges, j, e) > + { > + start_sequence (); > + > + EXECUTE_IF_SET_IN_BITMAP (pseudos_to_replace, > + FIRST_PSEUDO_REGISTER, regno, it) > + emit_insn (gen_move_insn (regno_reg_rtx[regno], > + regno_reg_rtx[regno])); > + > + insn = get_insns (); > + end_sequence (); > + > + new_bb = split_edge_and_insert (e, insn); > + new_exit_bbs.safe_push (new_bb); > + > + if (dump_file) > + fprintf (dump_file, ";; Split exit edge %d->%d\n", > + e->src->index, exit_bb->index); > + } > + > + /* Redo chain construction with the new exit BBs. */ > + df_remove_problem (df_chain); > + > + df_set_flags (DF_NO_HARD_REGS); > + df_chain_add_problem (DF_DU_CHAIN + DF_UD_CHAIN); > + > + df_analyze (); > + > + /* Iterate over the new exit BBs, which by this point contain only > + instructions of the form "mov Rx, Rx". Try to rename the source > + pseudo together with the whole DU-chain of its definition. */ > + FOR_EACH_VEC_ELT (new_exit_bbs, j, new_bb) > + FOR_BB_INSNS (new_bb, insn) > + { > + rtx set, old_reg, new_reg; > + df_ref use; > + df_link *def, *du; > + > + if (!INSN_P (insn)) > + continue; > + > + /* Each insn is a single set of a pseudo to itself. Check > + conditions required for renaming the use operand of the insn. */ > + set = single_set (insn); > + gcc_assert (set); > + > + use = df_single_use (DF_INSN_INFO_GET (insn)); > + gcc_assert (use); > + > + /* There must be exactly one reachable definition. */ > + def = DF_REF_CHAIN (use); > + if (!def || !def->ref || DF_REF_IS_ARTIFICIAL (def->ref) || def->next) > + continue; > + > + /* The definition must be the only one reachable by all its uses. */ > + if (has_use_with_multiple_defs_p (def)) > + continue; > + > + /* If the conditions above are met, replace the def and all its uses > + with a newly allocated pseudo. */ > + old_reg = SET_DEST (set); > + new_reg = gen_reg_rtx (GET_MODE (old_reg)); > + > + if (dump_file) > + { > + fprintf (dump_file, ";; replacing DU-chain for reg %d (def: ", > + REGNO (old_reg)); > + df_refs_chain_dump (def->ref, false, dump_file); > + fprintf (dump_file, ") with reg %d\n", REGNO (new_reg)); > + } > + > + for (du = DF_REF_CHAIN (def->ref); du; du = du->next) > + validate_replace_src_group (old_reg, new_reg, > + DF_REF_INSN (du->ref)); > + > + validate_replace_rtx_subexp (old_reg, new_reg, > + DF_REF_INSN (def->ref), DF_REF_REAL_LOC (def->ref)); > + } > + > + df_clear_flags (DF_NO_HARD_REGS); > +} > + > /* Release OPT_INFO. */ > > static void > diff --git a/gcc/testsuite/gcc.dg/loop-exit-split-1.c > b/gcc/testsuite/gcc.dg/loop-exit-split-1.c > new file mode 100644 > index 00000000000..a732ced44ad > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/loop-exit-split-1.c > @@ -0,0 +1,31 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -funroll-all-loops -fno-variable-expansion-in-unroller > -fsplit-exit-in-unroller -fdump-rtl-loop2_unroll-details" } */ > + > +/* Reduced from Coremark's core_list_reverse () function. */ > + > +struct list_node { > + struct list_node *next; > +}; > + > +struct list_node * > +list_reverse (struct list_node *list) > +{ > + struct list_node *next = 0, *tmp; > + #pragma GCC unroll 4 > + while (list) > + { > + tmp = list->next; > + list->next = next; > + next = list; > + list = tmp; > + } > + > + return next; > +} > + > +/* { dg-final { scan-rtl-dump-times "15:.*: optimized: loop unrolled 3 > times" 1 "loop2_unroll" } } */ > +/* { dg-final { scan-rtl-dump-times ";; Splitting loop exit for loop 1" 1 > "loop2_unroll" } } */ > +/* { dg-final { scan-rtl-dump-times ";; Found definition of \\\(reg \[0-9\]+ > \\\[ list \\\]\\\)" 1 "loop2_unroll" } } */ > +/* { dg-final { scan-rtl-dump-times ";; added \[0-9\]+ to list of pseudos to > replace" 1 "loop2_unroll" } } */ > +/* { dg-final { scan-rtl-dump-times ";; Split exit edge " 4 "loop2_unroll" } > } */ > +/* { dg-final { scan-rtl-dump-times ";; replacing DU-chain for reg \[0-9\]+" > 3 "loop2_unroll" } } */ > diff --git a/gcc/testsuite/gcc.dg/loop-exit-split-2.c > b/gcc/testsuite/gcc.dg/loop-exit-split-2.c > new file mode 100644 > index 00000000000..6a3f686a4aa > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/loop-exit-split-2.c > @@ -0,0 +1,33 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -funroll-all-loops -fvariable-expansion-in-unroller > -fsplit-exit-in-unroller -fdump-rtl-loop2_unroll-details -ffast-math" } */ > +/* { dg-require-effective-target hard_float } */ > + > +/* This testcase combines loop-exit-split1.c and var-expand1.c. */ > + > +struct list_node { > + struct list_node *next; > + float data; > +}; > + > +float > +list_reverse (struct list_node *list) > +{ > + struct list_node *next = 0, *tmp; > + float accum = 0; > + #pragma GCC unroll 4 > + while (list) > + { > + accum += list->data; > + list = list->next; > + } > + > + return accum; > +} > + > +/* { dg-final { scan-rtl-dump-times "Expanding Accumulator" 1 "loop2_unroll" > } } */ > +/* { dg-final { scan-rtl-dump-times "18:.*: optimized: loop unrolled 3 > times" 1 "loop2_unroll" } } */ > +/* { dg-final { scan-rtl-dump-times ";; Splitting loop exit for loop 1" 1 > "loop2_unroll" } } */ > +/* { dg-final { scan-rtl-dump-times ";; Found definition of \\\(reg > \[0-9\]+" 2 "loop2_unroll" } } */ > +/* { dg-final { scan-rtl-dump-times ";; added \[0-9\]+ to list of pseudos to > replace" 2 "loop2_unroll" } } */ > +/* { dg-final { scan-rtl-dump-times ";; Split exit edge " 4 "loop2_unroll" } > } */ > +/* { dg-final { scan-rtl-dump-times ";; replacing DU-chain for reg \[0-9\]+" > 4 "loop2_unroll" } } */