On Mon, Jun 30, 2025 at 1:05 PM Richard Sandiford <richard.sandif...@arm.com> wrote: > > Christoph Müllner <christoph.muell...@vrull.eu> writes: > > This patch converts the fold-mem-offsets pass from DF to RTL-SSA. > > Along with this conversion, the way the pass collects information > > was completely reworked. Instead of visiting each instruction multiple > > times, this is now down only once. > > > > Most significant changes are: > > * The pass operates mainly on insn_info objects from RTL-SSA. > > * Single iteration over all nondebug INSNs for identification > > of fold-mem-roots. Then walk of the fold-mem-roots' DEF-chain > > to collect foldable constants. > > * The class fold_mem_info holds vectors for the DEF-chain of > > the to-be-folded INSNs (fold_agnostic_insns, which don't need > > to be adjusted, and fold_insns, which need their constant to > > be set to zero). > > * Introduction of a single-USE mode, which only collects DEFs, > > that have a single USE and therefore are safe to transform > > (the fold-mem-root will be the final USE). This mode is fast > > and will always run (unless disabled via -fno-fold-mem-offsets). > > * Introduction of a multi-USE mode, which allows DEFs to have > > multiple USEs, but all USEs must be part of any fold-mem-root's > > DEF-chain. The analysis of all USEs is expensive and therefore, > > this mode is disabled for highly connected CFGs. Note, that > > multi-USE mode will miss some opportunities that the single-USE > > mode finds (e.g. multi-USE mode fails for fold-mem-offsets-3.c). > > > > The following testing was done: > > * Bootstrapped and regtested on aarch64-linux and x86-64-linux. > > * Regtested on riscv64-linux. > > * SPEC CPU 2017 tested on aarch64 and riscv64-linux. > > > > The number of applied optimizations of different versions/modes > > of fold-mem-offsets in SPEC CPU2017 on RISC-V rv64gc_zba_zbb_zbs > > is as follows: > > > > Upstream: > > 500.perlbench_r: 169 > > 502.gcc_r: 624 > > 520.omnetpp_r: 1301 > > 523.xalancbmk_r: 23 > > 525.x264_r: 705 > > 531.deepsjeng_r: 36 > > 541.leela_r: 19 > > 548.exchange2: 90 > > 557.xz_r: 11 > > SUM: 2151 > > > > New single-USE: > > 500.perlbench_r: 70 > > 502.gcc_r: 263 > > 520.omnetpp_r: 1100 > > 523.xalancbmk_r: 10 > > 525.x264_r: 95 > > 531.deepsjeng_r: 19 > > 541.leela_r: 252 > > 548.exchange2: 13 > > 557.xz_r: 11 > > SUM: 1833 > > > > New multi-USE: > > 500.perlbench_r: 186 > > 502.gcc_r: 744 > > 520.omnetpp_r: 1187 > > 523.xalancbmk_r: 22 > > 525.x264_r: 985 > > 531.deepsjeng_r: 21 > > 541.leela_r: 87 > > 548.exchange2: 63 > > 557.xz_r: 23 > > SUM: 3318 > > > > New single-USE then multi-USE: > > 500.perlbench_r: 192 > > 502.gcc_r: 761 > > 520.omnetpp_r: 1673 > > 523.xalancbmk_r: 22 > > 525.x264_r: 995 > > 531.deepsjeng_r: 21 > > 541.leela_r: 252 > > 548.exchange2: 63 > > 557.xz_r: 23 > > SUM: 4002 > > > > A compile time analysis with `/bin/time -v ./install/usr/local/bin/gcc -O2 > > all.i` > > (all.i from PR117922) shows: > > * -fno-fold-mem-offsets: 289 s (user time) / 15572436 kBytes (max resident > > set size) > > * -ffold-mem-offsets: 339 s (user time) / 23606516 kBytes (max resident > > set size) > > Adding -fexpensive-optimizations to enable multi-USE mode does not have > > an impact on the duration or the memory footprint. > > > > SPEC CPU 2017 showed no significant performance impact on aarch64-linux. > > > > gcc/ChangeLog: > > > > PR rtl-optimization/117922 > > * fold-mem-offsets.cc (INCLUDE_ALGORITHM): Added definition. > > (INCLUDE_FUNCTIONAL): Likewise. > > (INCLUDE_ARRAY): Likewise. > > (class pass_fold_mem_offsets): Moved to bottom of file. > > (get_fold_mem_offset_root): Converted to RTL-SSA. > > (get_single_def_in_bb): Converted to RTL-SSA. > > (get_uses): New. > > (has_foldable_uses_p): Converted to RTL-SSA. > > (fold_offsets): Converted to RTL-SSA. > > (fold_offsets_1): Converted to RTL-SSA. > > (get_fold_mem_root): Removed. > > (do_check_validity): New. > > (do_analysis): Removed. > > (insn_uses_not_in_bitmap): New. > > (do_fold_info_calculation): Removed. > > (drop_unsafe_candidates): New. > > (do_commit_offset): Converted to RTL-SSA. > > (compute_validity_closure): Removed. > > (do_commit_insn): Changed to change INSN in place. > > (fold_mem_offsets_single_use): New. > > (fold_mem_offsets_multi_use): New. > > (pass_fold_mem_offsets::execute): Moved to bottom of file. > > (fold_mem_offsets): New. > > > > Signed-off-by: Christoph Müllner <christoph.muell...@vrull.eu> > > --- > > gcc/fold-mem-offsets.cc | 1138 ++++++++++++++++++++------------------- > > 1 file changed, 596 insertions(+), 542 deletions(-) > > Thanks for doing this. I have some specific comments below, but my > main general comment is this: > > When using rtl-ssa, changes to instructions should happen through the > rtl-ssa changes framework, since that keeps the metadata up-to-date. > It also ensures that (for example) uses in debug instructions are not > accidentally left behind. > > As part of this, I think it would be better to make all the rtl insn > changes tentatively *before* checking validity, rather than building > up a list of rtl insns that needs to be changed later. That is, > each attempt like: > > /* Walk DEF-chain and collect info->fold_insns, > info->fold_agnostic_insns > and the resulting offset. */ > info->added_offset = fold_offsets (info->insn, info->reg, info, false); > if (info->added_offset == 0 || !do_check_validity (info)) > { > delete info; > continue; > } > > should first create an insn_change_watermark, then use validate_change > & co (from recog.cc) to change the instructions as soon as the desired > changes are known, passing true to the in_group parameter. Then, if the > final validity checks fail, the code can continue without committing the > changes, so that everything reverts back to the way that it was before.
Thanks for the hints. I tried to find a design that builds on this, but I don't see how this approach could work here. For the single-use case, we don't need to speculatively apply changes (we have all the information to make a final decision). For the multi-use case, we first need to ensure that all changes are only affecting instructions that occur in any fold-mem-offset sequence (otherwise changes won't be safe). The function drop_unsafe_candidates() may drop some of the fold-mem-offset sequences, if unsafe changes would be made. And once this function is done, we know which instructions to change. If we would do speculatively change code, than we would need to unroll specific changes. That's not supported by recog. > This also makes it easier to get the target conditions right. The patch > checks whether something is a member of GENERAL_REGS and checks whether > the new memory meets memory_address_addr_space_p. But what ultimately > matters is whether the new instruction is recognised and whether it > matches its constraints. In the case of rtl-ssa, this is all done by > rtl-ssa's version of recog. The register class checks and the > memory_address_addr_space_p checks should then not be needed, > and could actively prevent valid optimisations. In order to avoid low-level instruction manipulation, I changed the code to use recog (validate_change/verify_changes/cancel_changes/confirm_change_group). However, I don't see how I can use RTL-SSA instead as code changes do: * replace a constant from N to 0 in arithmetic instructions * replace a constant from I to J in memory accesses I.e., there are no changes to SSA information. We may delete self-assignments, which would be the only case where I can see RTL-SSA could be used. > Using the rtl-ssa framework also means that changes_are_worthwhile > can be used to test whether each set of changes is beneficial. > > Some other general comments: > > fold_mem_offsets <-> fold_mem_offsets_1 seems to have unbounded mutual > recursion, which could lead to stack overflow on hosts with small stacks. > It would probably be better to use a worklist instead. fold_offset () will exit for DEFs that are not in the same BB. > Alternatively, and perhaps more simply, the pass could record offset > information as part of the current instruction walk, storing the result > in a hash_map. The offset information for the current instruction > (that is, the information that gets entered into the hash_map for > the current instruction) would then depend on the contents of that > instruction and the current contents of the hash_map. That's just a > suggestion though; gathering the data lazily is ok with me too if > gathering it eagerly is likely to generate a lot of redundant work. This would result in a hash-map with one entry per instruction of a BB. And we would have to analyse each instruction (instead of a subset which is found by following up the DEFs). I would prefer to keep the lazy approach. > > diff --git a/gcc/fold-mem-offsets.cc b/gcc/fold-mem-offsets.cc > > index c1c94472a071..0e777c32ee31 100644 > > --- a/gcc/fold-mem-offsets.cc > > +++ b/gcc/fold-mem-offsets.cc > > [...] > > /* This pass tries to optimize memory offset calculations by moving > > constants > > Question about a pre-existing comment, sorry: > > For example it can transform code like this: > > add t4, sp, 16 > add t2, a6, t4 > shl t3, t2, 1 > ld a2, 0(t3) > add a2, 1 > sd a2, 8(t2) > > into the following (one instruction less): > > add t2, a6, sp > shl t3, t2, 1 > ld a2, 32(t3) > add a2, 1 > sd a2, 24(t2) > > Is that code possible in practice? It seems to be multiplying a pointer > by 2 (that is, the base seems to be sp * 2 rather than sp). I don't have code to reproduce the sequence, so I can't provide an answer. I'll go ahead and replace that with one that is part of the test suite. > > [...] > > + /* Don't fold when we have unspec / volatile. */ > > + if (GET_CODE (src) == UNSPEC > > + || GET_CODE (src) == UNSPEC_VOLATILE > > + || GET_CODE (dest) == UNSPEC > > + || GET_CODE (dest) == UNSPEC_VOLATILE) > > This is pre-existing, sorry, but: destinations can't be UNSPECs or > UNSPEC_VOLATILEs, so I think we should remove the last two lines. Will be done. > > [...] > > +/* Get the single reaching definition of REG used in INSN within a BB. > > + Return the definition or NULL if there's no definition with the desired > > + criteria. If SINGLE_USE is set to true the DEF must have exactly one > > + USE resulting in a 1:1 DEF-USE relationship. If set to false, then a > > + 1:n DEF-USE relationship is accepted and the caller must take care to > > + ensure all USEs are safe for folding. */ > > +static set_info * > > +get_single_def_in_bb (insn_info *insn, rtx reg, bool single_use) > > { > > - df_ref use; > > - struct df_link *ref_chain, *ref_link; > > - > > - FOR_EACH_INSN_USE (use, insn) > > + /* Get the use_info of the base register. */ > > + for (use_info *use : insn->uses ()) > > { > > - if (GET_CODE (DF_REF_REG (use)) == SUBREG) > > + /* Other USEs can be ignored and multiple equal USEs are fine. */ > > + if (use->regno () != REGNO (reg)) > > + continue; > > This can use: > > use_info *use = find_access (insn->uses (), REGNO (reg)); > if (!use) > return NULL; > > and so avoid the loop. Will be done. > > + > > + /* Don't handle subregs for now. */ > > + if (use->includes_subregs ()) > > return NULL; > > - if (REGNO (DF_REF_REG (use)) == REGNO (reg)) > > - break; > > - } > > > > - if (!use) > > - return NULL; > > + /* Get the DEF of the register. */ > > + set_info *def = use->def (); > > + if (!def) > > + return NULL; > > > > - ref_chain = DF_REF_CHAIN (use); > > + /* Limit the amount of USEs of DEF to 1. */ > > + auto iter = def->nondebug_insn_uses (); > > + if (single_use && ++iter.begin () != iter.end ()) > > This can use def->single_nondebug_use () Will be done. > > + return NULL; > > > > - if (!ref_chain) > > - return NULL; > > + /* Don't handle multiregs for now. */ > > + if (def->includes_multiregs ()) > > + return NULL; > > > > - for (ref_link = ref_chain; ref_link; ref_link = ref_link->next) > > - { > > - /* Problem getting some definition for this instruction. */ > > - if (ref_link->ref == NULL) > > + /* Only consider uses whose definition comes from a real instruction > > + and has no notes attached. */ > > + insn_info *def_insn = def->insn (); > > + if (def_insn->is_artificial () || def_insn->first_note ()) > > return NULL; > > What's the reason for the first_note check? That's a mistake. I mixed up insn_notes and REG_NOTES. The upstream code does this in get_uses(): /* We do not handle REG_EQUIV/REG_EQ notes for now. */ if (DF_REF_FLAGS (ref_link->ref) & DF_REF_IN_NOTE) return NULL; I'll filter out such cases in get_single_def_in_bb() with: find_reg_note (def_rtl, REG_EQUIV, NULL_RTX) find_reg_note (def_rtl, REG_EQUAL, NULL_RTX) > > - if (DF_REF_INSN_INFO (ref_link->ref) == NULL) > > + > > + /* No parallel expressions or clobbers. */ > > + if (def_insn->num_defs () != 1) > > return NULL; > > - if (global_regs[REGNO (reg)] > > - && !set_of (reg, DF_REF_INSN (ref_link->ref))) > > + > > + rtx_insn *def_rtl = def_insn->rtl (); > > + if (!NONJUMP_INSN_P (def_rtl) || RTX_FRAME_RELATED_P (def_rtl)) > > return NULL; > > - } > > > > - if (ref_chain->next) > > - return NULL; > > + /* Check if the DEF is a SET of the expected form. */ > > + rtx def_set = simple_regno_set (PATTERN (def_rtl), def->regno ()); > > + if (!def_set) > > + return NULL; > > > > - rtx_insn *def = DF_REF_INSN (ref_chain->ref); > > + /* Ensure DEF and USE are in the same BB. */ > > + if (def->bb () != insn->bb ()) > > + return NULL; > > > > - if (BLOCK_FOR_INSN (def) != BLOCK_FOR_INSN (insn)) > > - return NULL; > > + /* Ensure the definition comes before the insn. */ > > + if (*use->insn () < *def_insn) > > + return NULL; > > This is redundant: the definition always comes earlier (in RPO) than the use. This was added upstream in response to a bootstrap issue (PR111601). I kept it for that reason. I'll drop it. > > - if (DF_INSN_LUID (def) > DF_INSN_LUID (insn)) > > - return NULL; > > + return def; > > + } > > > > - return def; > > + return NULL; > > } > > > > -/* Get all uses of REG which is set in INSN. Return the use list or NULL > > if a > > - use is missing / irregular. If SUCCESS is not NULL then set it to > > false if > > - there are missing / irregular uses and true otherwise. */ > > -static df_link * > > -get_uses (rtx_insn *insn, rtx reg, bool *success) > > +/* Test if all USEs of DEF (which defines REG) meet certain criteria to be > > + foldable. Returns true iff all USEs are fine or false otherwise. */ > > +static bool > > +has_foldable_uses_p (set_info *def, rtx reg) > > { > > - df_ref def; > > - > > - if (success) > > - *success = false; > > - > > - FOR_EACH_INSN_DEF (def, insn) > > - if (REGNO (DF_REF_REG (def)) == REGNO (reg)) > > - break; > > - > > - if (!def) > > - return NULL; > > + /* We only fold through instructions that are transitively used as > > + memory addresses and do not have other uses. Use the same logic > > + from offset calculation to visit instructions that can propagate > > + offsets and keep track of them in CAN_FOLD_INSNS. */ > > + for (use_info *use : def->nondebug_insn_uses ()) > > + { > > Would use->includes_address_uses () be useful here, as a short-cut? After thinking about has_foldable_uses_p () again, I conclude that this function is pointless in the new implementation because now all USEs must be part of instructions that are successfully identified by fold_offsets_1 (and thus are foldable). The original code used these tests (in the loop in fold_offsets ()) to filter out Instructions that are not safe to modify, as it did not require all USEs to be part of INSNs that have been identified. > > + insn_info *use_insn = use->insn (); > > + if (use_insn->is_artificial ()) > > + return false; > > > > - df_link *ref_chain = DF_REF_CHAIN (def); > > - int insn_luid = DF_INSN_LUID (insn); > > - basic_block insn_bb = BLOCK_FOR_INSN (insn); > > + /* Punt if the use is anything more complicated than a set > > + (clobber, use, etc). */ > > + rtx_insn *use_rtl = use_insn->rtl (); > > + if (!NONJUMP_INSN_P (use_rtl) || GET_CODE (PATTERN (use_rtl)) != SET) > > + return false; > > > > - for (df_link *ref_link = ref_chain; ref_link; ref_link = ref_link->next) > > - { > > - /* Problem getting a use for this instruction. */ > > - if (ref_link->ref == NULL) > > - return NULL; > > - if (DF_REF_CLASS (ref_link->ref) != DF_REF_REGULAR) > > - return NULL; > > + /* Special case: A foldable memory store is not foldable if it > > + mentions DEST outside of the address calculation. */ > > + rtx use_set = PATTERN (use_rtl); > > + if (use_set && MEM_P (SET_DEST (use_set)) > > + && reg_mentioned_p (reg, SET_SRC (use_set))) > > + return false; > > > > - rtx_insn *use = DF_REF_INSN (ref_link->ref); > > - if (DEBUG_INSN_P (use)) > > - continue; > > + if (use->bb () != def->bb ()) > > + return false; > > > > - /* We do not handle REG_EQUIV/REG_EQ notes for now. */ > > - if (DF_REF_FLAGS (ref_link->ref) & DF_REF_IN_NOTE) > > - return NULL; > > - if (BLOCK_FOR_INSN (use) != insn_bb) > > - return NULL; > > /* Punt if use appears before def in the basic block. See PR111601. > > */ > > - if (DF_INSN_LUID (use) < insn_luid) > > - return NULL; > > + if (*use_insn < *def->insn ()) > > + return false; > > As above, this check is redundant. Will be done. > > > [...] > > @@ -856,69 +897,82 @@ pass_fold_mem_offsets::execute (function *fn) > > "fold-mem-offsets: %d basic blocks and %d edges/basic block", > > n_basic_blocks_for_fn (cfun), > > n_edges_for_fn (cfun) / n_basic_blocks_for_fn (cfun)); > > - return 0; > > + multi_use_mode = false; > > } > > > > - df_set_flags (DF_EQ_NOTES + DF_RD_PRUNE_DEAD_DEFS + > > DF_DEFER_INSN_RESCAN); > > - df_chain_add_problem (DF_UD_CHAIN + DF_DU_CHAIN); > > + /* There is a conflict between this pass and RISCV's shorten-memrefs > > + pass. For now disable folding if optimizing for size because > > + otherwise this cancels the effects of shorten-memrefs. */ > > + cgraph_node *n = cgraph_node::get (fn->decl); > > + if (n && n->optimize_for_size_p ()) > > + return 0; > > + > > + /* Initialise RTL SSA. */ > > + calculate_dominance_info (CDI_DOMINATORS); > > df_analyze (); > > + crtl->ssa = new rtl_ssa::function_info (cfun); > > > > - bitmap_initialize (&can_fold_insns, NULL); > > - bitmap_initialize (&candidate_fold_insns, NULL); > > - bitmap_initialize (&cannot_fold_insns, NULL); > > + /* The number of instructions that were simplified or eliminated. */ > > + int stats_fold_count = 0; > > > > - stats_fold_count = 0; > > + /* Fold mem offsets with DEFs that have a single USE. */ > > + stats_fold_count += fold_mem_offsets_single_use (); > > > > - basic_block bb; > > - rtx_insn *insn; > > - FOR_ALL_BB_FN (bb, fn) > > + /* Fold mem offsets with DEFs that have multiple USEs. */ > > + if (multi_use_mode || flag_expensive_optimizations) > > { > > - /* There is a conflict between this pass and RISCV's shorten-memrefs > > - pass. For now disable folding if optimizing for size because > > - otherwise this cancels the effects of shorten-memrefs. */ > > - if (optimize_bb_for_size_p (bb)) > > - continue; > > - > > - fold_info_map fold_info; > > + if (dump_file) > > + fprintf (dump_file, "Starting multi-use analysis\n"); > > + stats_fold_count += fold_mem_offsets_multi_use (); > > + } > > > > - bitmap_clear (&can_fold_insns); > > - bitmap_clear (&candidate_fold_insns); > > - bitmap_clear (&cannot_fold_insns); > > + statistics_counter_event (cfun, "Number of folded instructions", > > + stats_fold_count); > > > > - FOR_BB_INSNS (bb, insn) > > - do_analysis (insn); > > + free_dominance_info (CDI_DOMINATORS); > > + cleanup_cfg (0); > > With the changes suggested in the general comments, this would become: > > if (crtl->ssa->perform_pending_updates ()) > cleanup_cfg (0); Will be done. > > > > > - FOR_BB_INSNS (bb, insn) > > - do_fold_info_calculation (insn, &fold_info); > > + delete crtl->ssa; > > + crtl->ssa = nullptr; > > > > - FOR_BB_INSNS (bb, insn) > > - if (fold_mem_info **info = fold_info.get (insn)) > > - do_check_validity (insn, *info); > > + delete_trivially_dead_insns (get_insns (), max_reg_num ()); > > Which cases is this for? Going back to the example: There are none (this was copied from fwprop, but it is indeed useless). I'll drop this. > For example it can transform code like this: > > add t4, sp, 16 > add t2, a6, t4 > shl t3, t2, 1 > ld a2, 0(t3) > add a2, 1 > sd a2, 8(t2) > > into the following (one instruction less): > > add t2, a6, sp > shl t3, t2, 1 > ld a2, 32(t3) > add a2, 1 > sd a2, 24(t2) > > I would have expected the pass to be able to delete the t4 instruction > on the fly, rather than need a separate instruction walk. > > Thanks, > Richard