https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121857
Bug ID: 121857 Summary: pair-fusion should replace the core ldp/stp peepholes on aarch64 Product: gcc Version: 16.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: acoplan at gcc dot gnu.org Target Milestone: --- This PR is about replacing the LDP/STP peepholes on AArch64 with the pair-fusion pass. While there are already many cases where pair-fusion is more powerful than the peepholes, perhaps surprisingly, there are still cases that the peepholes can handle which pair-fusion cannot. We should improve pair-fusion so that it can eventually replace the peepholes and the sched-fusion machinery in the AArch64 backend. This was one of the original design goals of the pass, but it turns out that the first round of implementation didn't quite get there. The peepholes can be divided into three main groups: 1. Core peepholes. 2. Sign/zero-extending peepholes. 3. Out-of-range peepholes (combining 4 accesses into 2 pairs + anchor). This PR is mostly about achieving parity with the first group (where we are currently limited by pair discovery in the pass); handling the other two cases in pair-fusion is orthogonal and future work. For a concrete example of an opportunity that we should handle but currently don't, take the following: ```c void g(int *,int *); int f(void) { int x,y; x = y = 0; g(&x,&y); } ``` with -O2 we get: ``` f: stp x29, x30, [sp, -32]! mov x29, sp add x0, sp, 24 add x1, sp, 28 stp wzr, wzr, [sp, 24] bl g ldp x29, x30, [sp], 32 ret ``` but -O2 -fno-peephole2 shows pair-fusion misses the stp: ``` f: stp x29, x30, [sp, -32]! mov x29, sp add x0, sp, 24 add x1, sp, 28 str wzr, [sp, 24] str wzr, [sp, 28] bl g ldp x29, x30, [sp], 32 ret ``` looking at the `ldp_fusion1` dump, we have: ``` [bb 2] tracking insn 7 via mem expr <var_decl 0x7f2903606ed8 y> [L=0 FP=0, SImode, off=0] [bb 2] tracking insn 8 via mem expr <var_decl 0x7f2903606e40 x> [L=0 FP=0, SImode, off=0] ``` the problem is that the current strategy in pair-fusion is to prefer a `MEM_EXPR` base if it exists for a given access. That is, for accesses with both RTL register and `MEM_EXPR` bases, we only track them via their `MEM_EXPR` base. This is to avoid the bookeeping nightmare that would ensue if tracking accesses relative to both kinds of base simultaneously. This strategy works well for e.g. accesses to the same array or struct, but falls short in this example. In this example we have separate `MEM_EXPR` bases (x and y var_decls) but a common RTL register base (the soft frame pointer). I propose the following approach to handle this and related cases: Roughly, when processing a new access: (1) If the access has both an RTL and tree base, record the relationship between those bases in a canonicalization table. (2) If either base already has an entry in the canonicalization table, track the access relative to the canonical base C. (3) If the conditions for both (1) and (2) are true, add any missing entries to the canonicalization table implied by the relationship between the bases, mapping those bases onto C. We have a hierarchy of bases which (as things stand) is: ``` RTL-SSA defs < MEM_EXPR tree decls ``` and we canonicalize up the hierarchy. To make this concrete again, for the above example, suppose we have the following RTL: ``` (insn 5 2 6 2 (set (mem/c:SI (plus:DI (reg/f:DI 64 sfp) (const_int -4 [0xfffffffffffffffc])) [1 y+0 S4 A32]) (const_int 0 [0])) "t.c":4:9 103 {*movsi_aarch64} (nil)) (insn 6 5 7 2 (set (mem/c:SI (plus:DI (reg/f:DI 64 sfp) (const_int -8 [0xfffffffffffffff8])) [1 x+0 S4 A64]) (const_int 0 [0])) "t.c":4:5 103 {*movsi_aarch64} (nil)) ``` for i5 we would track it relative to the `MEM_EXPR` base y (as we do currently) but also note the relationship `sfp - 4 = y`. Then, when we process i6, we would do the following: 1. Recall the canonicalization `sfp - 4 -> y`, and track the access relative to y instead of x. 2. Add the canonicalization `x -> y - 4` to the table. Thus both accesses are tracked relative to the same `MEM_EXPR` base y, and we can see the adjacency and form the pair. As an aside, I believe this approach can be further generalized in the presence of section anchors to handle cases such as the following: ```c int x,y; void f(void) { x = y = 0; } ``` for which we currently generate: ``` f: adrp x0, .LANCHOR0 add x1, x0, :lo12:.LANCHOR0 str wzr, [x0, #:lo12:.LANCHOR0] str wzr, [x1, 4] ret ``` but this could be an stp with x1 as the base. We could simply extend the above approach to add section anchors as a third kind of base, at the top of the hierarchy, giving: ``` RTL-SSA defs < MEM_EXPR decls < section anchors ``` which should enable us to handle such opportunities too.