Re: [PATCH v4] Implement new RTL optimizations pass: fold-mem-offsets.
On 8/10/23 03:28, Manolis Tsamis wrote: Hi Jeff, Thanks a lot for providing all this information and testcase! I have been able to reproduce the issue with it. I have investigated the cause of the issue and it's not what you mention, the uses of all intermediate calculations are properly taken into account. In this case it would be fine to alter the runtime value of insn 58 because its uses are also memory accesses that can have a folded offset. The real issue is that the offset for these two (insn 60/61) is not updated. [ ... ] This instruction doesn't match any of these since the if with the CONST_INT only accepts arg1 being a single REG (that could be extended, but that's not the point now) and as a result we do `return 0;` But return 0 at this point loses the offset 1 calculated from arg1 previously and which is stored in `offset`. And that's our bug :) Changing that return 0 to return offset (i.e. return what we have up to now) fixes this testcase with the insn being folded and all offsets updating properly. But it got me thinking that this is a more general issue that I need to address differently. There are more `return 0;` cases in the code which say "We know how to handle this rtx code, but don't know how to propagate", but returning 0 is not enough. It's also not correct to propagate an offset from just one argument and punt on the other because the argument we punt might contain references to the other argument. Funny, I'd looked at that as well (return 0 signaling two different things), but from the standpoint of the analysis phase it didn't matter as we don't use the returned value. So I set it aside. So the general solution that solves all the issues is: If we don't fully understand how to handle an instruction and its arguments in fold_offsets then we need to mark it in one of the bitsets (either set in cannot_fold or don't set in can_fold), whereas currently a insn that has transitively all uses as foldable is foldable. I'm still struggling a bit with using the transitive set as a global like we do. I haven't come up with a case where it fails, but every time I look at it I wonder if it's going to go awry at some point. Basically we'd be looking for a case where we have two MEMs which share some bit of address calculation, where one of the MEMs is foldable, but the other is not for some reason. If we adjust the address calculations and the foldable MEM, then do we run the risk of needing to change the non-foldable MEM? Jeff
Re: [PATCH v4] Implement new RTL optimizations pass: fold-mem-offsets.
Hi Jeff, Thanks a lot for providing all this information and testcase! I have been able to reproduce the issue with it. I have investigated the cause of the issue and it's not what you mention, the uses of all intermediate calculations are properly taken into account. In this case it would be fine to alter the runtime value of insn 58 because its uses are also memory accesses that can have a folded offset. The real issue is that the offset for these two (insn 60/61) is not updated. I think this can be seen by using -fdump-rtl-fold_mem_offsets-all which also shows root memory instructions and when instructions are marked for propagation. Here's the relevant dump (a bit long but should help with this): Starting analysis from root: (insn 2 95 3 2 (set (reg/v/f:SI 3 %d3 [orig:44 pD.1867 ] [44]) (mem/f/c:SI (plus:SI (reg/f:SI 15 %sp) (const_int 60 [0x3c])) [1 pD.1867+0 S4 A32])) "/home/mtsamis/temp/fmo-bug-2/a.c":23:1 55 {*movsi_m68k2} (expr_list:REG_EQUIV (mem/f/c:SI (plus:SI (reg/f:SI 15 %sp) (const_int 60 [0x3c])) [1 pD.1867+0 S4 A32]) (nil))) Starting analysis from root: (insn 3 2 4 2 (set (reg/v/f:SI 2 %d2 [orig:45 rD.1868 ] [45]) (mem/f/c:SI (plus:SI (reg/f:SI 15 %sp) (const_int 64 [0x40])) [1 rD.1868+0 S4 A32])) "/home/mtsamis/temp/fmo-bug-2/a.c":23:1 55 {*movsi_m68k2} (expr_list:REG_EQUIV (mem/f/c:SI (plus:SI (reg/f:SI 15 %sp) (const_int 64 [0x40])) [1 rD.1868+0 S4 A32]) (nil))) Starting analysis from root: (insn 14 12 16 2 (set (mem/f:SI (reg/f:SI 15 %sp) [1 S4 A16]) (reg/v/f:SI 2 %d2 [orig:45 rD.1868 ] [45])) "/home/mtsamis/temp/fmo-bug-2/a.c":25:21 discrim 1 54 {*movsi_m68k} (expr_list:REG_ARGS_SIZE (const_int 4 [0x4]) (nil))) Starting analysis from root: (insn 42 41 43 3 (set (mem:SI (reg:SI 8 %a0 [61]) [0 MEM [(voidD.37 *)_15]+0 S4 A8]) (const_int 1633837924 [0x61626364])) "/home/mtsamis/temp/fmo-bug-2/a.c":29:4 discrim 1 55 {*movsi_m68k2} (nil)) Starting analysis from root: (insn 43 42 45 3 (set (mem:QI (plus:SI (reg:SI 8 %a0 [61]) (const_int 4 [0x4])) [0 MEM [(voidD.37 *)_15]+4 S1 A8]) (const_int 101 [0x65])) "/home/mtsamis/temp/fmo-bug-2/a.c":29:4 discrim 1 62 {*m68k.md:1130} (expr_list:REG_DEAD (reg:SI 8 %a0 [61]) (nil))) Instruction marked for propagation: (insn 41 39 42 3 (set (reg:SI 8 %a0 [61]) (plus:SI (reg/f:SI 12 %a4 [52]) (reg:SI 13 %a5 [orig:36 _14 ] [36]))) "/home/mtsamis/temp/fmo-bug-2/a.c":29:4 discrim 1 150 {*addsi3_internal} (nil)) Starting analysis from root: (insn 60 59 61 3 (set (mem:HI (reg:SI 8 %a0 [73]) [0 MEM [(voidD.37 *)_19]+0 S2 A8]) (const_int 26215 [0x6667])) "/home/mtsamis/temp/fmo-bug-2/a.c":31:4 discrim 1 58 {*m68k.md:1084} (nil)) Starting analysis from root: (insn 61 60 63 3 (set (mem:QI (plus:SI (reg:SI 8 %a0 [73]) (const_int 2 [0x2])) [0 MEM [(voidD.37 *)_19]+2 S1 A8]) (const_int 0 [0])) "/home/mtsamis/temp/fmo-bug-2/a.c":31:4 discrim 1 62 {*m68k.md:1130} (expr_list:REG_DEAD (reg:SI 8 %a0 [73]) (nil))) Instruction marked for propagation: (insn 59 58 60 3 (set (reg:SI 8 %a0 [73]) (plus:SI (reg/f:SI 12 %a4 [52]) (reg:SI 8 %a0 [72]))) "/home/mtsamis/temp/fmo-bug-2/a.c":31:4 discrim 1 150 {*addsi3_internal} (nil)) Instruction marked for propagation: (insn 58 57 59 3 (set (reg:SI 8 %a0 [72]) (plus:SI (plus:SI (reg:SI 13 %a5 [orig:36 _14 ] [36]) (reg:SI 11 %a3 [49])) (const_int 5 [0x5]))) "/home/mtsamis/temp/fmo-bug-2/a.c":31:4 407 {*lea} (expr_list:REG_DEAD (reg:SI 13 %a5 [orig:36 _14 ] [36]) (expr_list:REG_DEAD (reg:SI 11 %a3 [49]) (nil Instruction marked for propagation: (insn 39 38 41 3 (set (reg:SI 13 %a5 [orig:36 _14 ] [36]) (plus:SI (reg:SI 10 %a2 [47]) (const_int 1 [0x1]))) "/home/mtsamis/temp/fmo-bug-2/a.c":29:4 150 {*addsi3_internal} (nil)) Memory offset changed from 0 to 1 for instruction: (insn 42 41 43 3 (set (mem:SI (reg:SI 8 %a0 [61]) [0 MEM [(voidD.37 *)_15]+0 S4 A8]) (const_int 1633837924 [0x61626364])) "/home/mtsamis/temp/fmo-bug-2/a.c":29:4 discrim 1 55 {*movsi_m68k2} (nil)) deferring rescan insn with uid = 42. Memory offset changed from 4 to 5 for instruction: (insn 43 42 45 3 (set (mem:QI (plus:SI (reg:SI 8 %a0 [61]) (const_int 4 [0x4])) [0 MEM [(voidD.37 *)_15]+4 S1 A8]) (const_int 101 [0x65])) "/home/mtsamis/temp/fmo-bug-2/a.c":29:4 discrim 1 62 {*m68k.md:1130} (expr_list:REG_DEAD (reg:SI 8 %a0 [61]) (nil))) deferring rescan insn with uid = 43. Instruction folded:(insn 39 38 41 3 (set (reg:SI 13 %a5 [orig:36 _14 ] [36]) (plus:SI (reg:SI 10 %a2 [47]) (const_int 1 [0x1]))) "/home/mtsamis/temp/fmo-bug-2/a.c":29:4 150 {*addsi3_internal} (nil)) Here you can see that insn 39 is the last to be marked for
Re: [PATCH v4] Implement new RTL optimizations pass: fold-mem-offsets.
On 8/7/23 08:33, Manolis Tsamis wrote: This is a new RTL pass that tries to optimize memory offset calculations by moving them from add immediate instructions to the memory loads/stores. For example it can transform this: addi t4,sp,16 add t2,a6,t4 shl t3,t2,1 ld a2,0(t3) addi a2,1 sd a2,8(t2) into the following (one instruction less): add t2,a6,sp shl t3,t2,1 ld a2,32(t3) addi a2,1 sd a2,24(t2) Although there are places where this is done already, this pass is more powerful and can handle the more difficult cases that are currently not optimized. Also, it runs late enough and can optimize away unnecessary stack pointer calculations. gcc/ChangeLog: * Makefile.in: Add fold-mem-offsets.o. * passes.def: Schedule a new pass. * tree-pass.h (make_pass_fold_mem_offsets): Declare. * common.opt: New options. * doc/invoke.texi: Document new option. * fold-mem-offsets.cc: New file. gcc/testsuite/ChangeLog: * gcc.target/riscv/fold-mem-offsets-1.c: New test. * gcc.target/riscv/fold-mem-offsets-2.c: New test. * gcc.target/riscv/fold-mem-offsets-3.c: New test. We still have the m68k issue to deal with. We've got these key insns going into fold-mem-offsets: (insn 39 38 41 3 (set (reg:SI 13 %a5 [orig:36 _14 ] [36]) (plus:SI (reg:SI 10 %a2 [47]) (const_int 1 [0x1]))) "j.c":19:3 157 {*addsi3_internal} (nil)) (insn 41 39 42 3 (set (reg:SI 8 %a0 [61]) (plus:SI (reg/f:SI 12 %a4 [52]) (reg:SI 13 %a5 [orig:36 _14 ] [36]))) "j.c":19:3 discrim 1 157 {*addsi3_internal} (nil)) (insn 42 41 43 3 (set (mem:SI (reg:SI 8 %a0 [61]) [0 MEM [(void *)_15]+0 S4 A8]) (const_int 1633837924 [0x61626364])) "j.c":19:3 discrim 1 55 {*movsi_m68k2} (nil)) (insn 43 42 45 3 (set (mem:QI (plus:SI (reg:SI 8 %a0 [61]) (const_int 4 [0x4])) [0 MEM [(void *)_15]+4 S1 A8]) (const_int 101 [0x65])) "j.c":19:3 discrim 1 62 {*m68k.md:1130} (expr_list:REG_DEAD (reg:SI 8 %a0 [61]) (nil))) [ ... ] (insn 58 57 59 3 (set (reg:SI 8 %a0 [72]) (plus:SI (plus:SI (reg:SI 13 %a5 [orig:36 _14 ] [36]) (reg:SI 11 %a3 [49])) (const_int 5 [0x5]))) "j.c":24:3 421 {*lea} (expr_list:REG_DEAD (reg:SI 13 %a5 [orig:36 _14 ] [36]) (expr_list:REG_DEAD (reg:SI 11 %a3 [49]) (nil f-m-o will propagate the (const_int 1) from insn 39 into insns 42 & 43 and modifies insn 39 into a simple copy. Note carefully that turning insn 39 into a simple copy changes the value in a5. As a result insn 58 computes the wrong value and the test (strlenopt-2) fails on the m68k. In fold_offsets we have this code: /* We only fold through instructions that are transitively used as memory addresses and do not have other uses. Use the same logic from offset calculation to visit instructions that can propagate offsets and keep track of them in CAN_FOLD_INSNS. */ bool success; struct df_link *uses = get_uses (def, dest, ), *ref_link; if (!success) return 0; for (ref_link = uses; ref_link; ref_link = ref_link->next) { rtx_insn* use = DF_REF_INSN (ref_link->ref); if (DEBUG_INSN_P (use)) continue; /* Punt if the use is anything more complicated than a set (clobber, use, etc). */ if (!NONJUMP_INSN_P (use) || GET_CODE (PATTERN (use)) != SET) return 0; /* This use affects instructions outside of CAN_FOLD_INSNS. */ if (!bitmap_bit_p (_fold_insns, INSN_UID (use))) return 0; rtx use_set = PATTERN (use); /* Special case: A foldable memory store is not foldable if it mentions DEST outside of the address calculation. */ if (use_set && MEM_P (SET_DEST (use_set)) && reg_mentioned_p (dest, SET_SRC (use_set))) return 0; } AFAICT that code is supposed to detect if there are uses outside of the set of insns that can be optimized. In our case DEF is a0 (the base of the memory references) and its def is insn 41. The assignment of a0 in insn 41 only reaches insns 42 and 43 which are marked in can_fold_insns. That's find and good, but insufficient for correctness. ISTM we must look at the uses of any insn where we change the result of the computation, not just the def of the base register in the memory reference. In this particular case we're going to modify insn 39, so we need to look at the uses of a5 (the value defined by insn 39). If we did that we'd see the use of a5 at insn 38 and insn 38 and reject the optimization. Testcase below. You should be able to see the incorrect transformation with a m68k-linux-gnu cross compiler with -O2. typedef unsigned int size_t; extern void abort (void); void *calloc (size_t, size_t); void
[PATCH v4] Implement new RTL optimizations pass: fold-mem-offsets.
This is a new RTL pass that tries to optimize memory offset calculations by moving them from add immediate instructions to the memory loads/stores. For example it can transform this: addi t4,sp,16 add t2,a6,t4 shl t3,t2,1 ld a2,0(t3) addi a2,1 sd a2,8(t2) into the following (one instruction less): add t2,a6,sp shl t3,t2,1 ld a2,32(t3) addi a2,1 sd a2,24(t2) Although there are places where this is done already, this pass is more powerful and can handle the more difficult cases that are currently not optimized. Also, it runs late enough and can optimize away unnecessary stack pointer calculations. gcc/ChangeLog: * Makefile.in: Add fold-mem-offsets.o. * passes.def: Schedule a new pass. * tree-pass.h (make_pass_fold_mem_offsets): Declare. * common.opt: New options. * doc/invoke.texi: Document new option. * fold-mem-offsets.cc: New file. gcc/testsuite/ChangeLog: * gcc.target/riscv/fold-mem-offsets-1.c: New test. * gcc.target/riscv/fold-mem-offsets-2.c: New test. * gcc.target/riscv/fold-mem-offsets-3.c: New test. Signed-off-by: Manolis Tsamis --- Changes in v4: - Add DF_EQ_NOTES flag to avoid incorrect state in notes. - Remove fold_mem_offsets_driver and enum fold_mem_phase. - Call recog when patching offsets in do_commit_offset. - Restore INSN_CODE after modifying insn in do_check_validity. Changes in v3: - Added propagation for more codes: sub, neg, mul. - Added folding / elimination for sub and const int moves. - For the validity check of the generated addresses also test memory_address_addr_space_p. - Replaced GEN_INT with gen_int_mode. - Replaced some bitmap_head with auto_bitmap. - Refactor each phase into own function for readability. - Add dump details. - Replace rtx iteration with reg_mentioned_p. - Return early for codes that we can't propagate through. Changes in v2: - Made the pass target-independant instead of RISCV specific. - Fixed a number of bugs. - Add code to handle more ADD patterns as found in other targets (x86, aarch64). - Improved naming and comments. - Fixed bitmap memory leak. gcc/Makefile.in | 1 + gcc/common.opt| 4 + gcc/doc/invoke.texi | 8 + gcc/fold-mem-offsets.cc | 725 ++ gcc/passes.def| 1 + .../gcc.target/riscv/fold-mem-offsets-1.c | 16 + .../gcc.target/riscv/fold-mem-offsets-2.c | 24 + .../gcc.target/riscv/fold-mem-offsets-3.c | 17 + gcc/tree-pass.h | 1 + 9 files changed, 797 insertions(+) create mode 100644 gcc/fold-mem-offsets.cc create mode 100644 gcc/testsuite/gcc.target/riscv/fold-mem-offsets-1.c create mode 100644 gcc/testsuite/gcc.target/riscv/fold-mem-offsets-2.c create mode 100644 gcc/testsuite/gcc.target/riscv/fold-mem-offsets-3.c diff --git a/gcc/Makefile.in b/gcc/Makefile.in index e99628cec07..d717e90bc44 100644 --- a/gcc/Makefile.in +++ b/gcc/Makefile.in @@ -1431,6 +1431,7 @@ OBJS = \ fixed-value.o \ fold-const.o \ fold-const-call.o \ + fold-mem-offsets.o \ function.o \ function-abi.o \ function-tests.o \ diff --git a/gcc/common.opt b/gcc/common.opt index 0888c15b88f..5f3d3e9706f 100644 --- a/gcc/common.opt +++ b/gcc/common.opt @@ -1248,6 +1248,10 @@ fcprop-registers Common Var(flag_cprop_registers) Optimization Perform a register copy-propagation optimization pass. +ffold-mem-offsets +Target Bool Var(flag_fold_mem_offsets) Init(1) +Fold instructions calculating memory offsets to the memory access instruction if possible. + fcrossjumping Common Var(flag_crossjumping) Optimization Perform cross-jumping optimization. diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index 674f956f4b8..30974309e9b 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -540,6 +540,7 @@ Objective-C and Objective-C++ Dialects}. -fauto-inc-dec -fbranch-probabilities -fcaller-saves -fcombine-stack-adjustments -fconserve-stack +-ffold-mem-offsets -fcompare-elim -fcprop-registers -fcrossjumping -fcse-follow-jumps -fcse-skip-blocks -fcx-fortran-rules -fcx-limited-range @@ -14314,6 +14315,13 @@ the comparison operation before register allocation is complete. Enabled at levels @option{-O1}, @option{-O2}, @option{-O3}, @option{-Os}. +@opindex ffold-mem-offsets +@item -ffold-mem-offsets +@itemx -fno-fold-mem-offsets +Try to eliminate add instructions by folding them in memory loads/stores. + +Enabled at levels @option{-O2}, @option{-O3}. + @opindex fcprop-registers @item -fcprop-registers After register allocation and post-register allocation instruction splitting, diff --git