stp peepholes on aarch64

acoplan at gcc dot gnu.org via Gcc-bugs Mon, 08 Sep 2025 08:03:06 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121857


            Bug ID: 121857
           Summary: pair-fusion should replace the core ldp/stp peepholes
                    on aarch64
           Product: gcc
           Version: 16.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: acoplan at gcc dot gnu.org
  Target Milestone: ---

This PR is about replacing the LDP/STP peepholes on AArch64 with the
pair-fusion pass.  While there are already many cases where pair-fusion is more
powerful than the peepholes, perhaps surprisingly, there are still cases that
the peepholes can handle which pair-fusion cannot.  We should improve
pair-fusion so that it can eventually replace the peepholes and the
sched-fusion machinery in the AArch64 backend.  This was one of the original
design goals of the pass, but it turns out that the first round of
implementation didn't quite get there.

The peepholes can be divided into three main groups:
1. Core peepholes.
2. Sign/zero-extending peepholes.
3. Out-of-range peepholes (combining 4 accesses into 2 pairs + anchor).

This PR is mostly about achieving parity with the first group (where we are
currently limited by pair discovery in the pass); handling the other two cases
in pair-fusion is orthogonal and future work.  For a concrete example of an
opportunity that we should handle but currently don't, take the following:

```c
void g(int *,int *);
int f(void) {
  int x,y;
  x = y = 0;
  g(&x,&y);
}
```

with -O2 we get:

```
f:
        stp     x29, x30, [sp, -32]!
        mov     x29, sp
        add     x0, sp, 24
        add     x1, sp, 28
        stp     wzr, wzr, [sp, 24]
        bl      g
        ldp     x29, x30, [sp], 32
        ret
```

but -O2 -fno-peephole2 shows pair-fusion misses the stp:

```
f:
        stp     x29, x30, [sp, -32]!
        mov     x29, sp
        add     x0, sp, 24
        add     x1, sp, 28
        str     wzr, [sp, 24]
        str     wzr, [sp, 28]
        bl      g
        ldp     x29, x30, [sp], 32
        ret
```

looking at the `ldp_fusion1` dump, we have:

```
[bb 2] tracking insn 7 via mem expr <var_decl 0x7f2903606ed8 y> [L=0 FP=0,
SImode, off=0]
[bb 2] tracking insn 8 via mem expr <var_decl 0x7f2903606e40 x> [L=0 FP=0,
SImode, off=0]
```

the problem is that the current strategy in pair-fusion is to prefer a
`MEM_EXPR` base if it exists for a given access.  That is, for accesses with
both RTL register and `MEM_EXPR` bases, we only track them via their `MEM_EXPR`
base.  This is to avoid the bookeeping nightmare that would ensue if tracking
accesses relative to both kinds of base simultaneously.  This strategy works
well for e.g. accesses to the same array or struct, but falls short in this
example.

In this example we have separate `MEM_EXPR` bases (x and y var_decls) but a
common RTL register base (the soft frame pointer).  I propose the following
approach to handle this and related cases:

Roughly, when processing a new access:
 (1) If the access has both an RTL and tree base, record the relationship
between those bases in a canonicalization table.
 (2) If either base already has an entry in the canonicalization table, track
the access relative to the canonical base C.
 (3) If the conditions for both (1) and (2) are true, add any missing entries
to the canonicalization table implied by the relationship between the bases,
mapping those bases onto C.

We have a hierarchy of bases which (as things stand) is:
```
RTL-SSA defs < MEM_EXPR tree decls
```
and we canonicalize up the hierarchy.

To make this concrete again, for the above example, suppose we have the
following RTL:

```
(insn 5 2 6 2 (set (mem/c:SI (plus:DI (reg/f:DI 64 sfp)
                (const_int -4 [0xfffffffffffffffc])) [1 y+0 S4 A32])
        (const_int 0 [0])) "t.c":4:9 103 {*movsi_aarch64}
     (nil))
(insn 6 5 7 2 (set (mem/c:SI (plus:DI (reg/f:DI 64 sfp)
                (const_int -8 [0xfffffffffffffff8])) [1 x+0 S4 A64])
        (const_int 0 [0])) "t.c":4:5 103 {*movsi_aarch64}
     (nil))
```

for i5 we would track it relative to the `MEM_EXPR` base y (as we do currently)
but also note the relationship `sfp - 4 = y`.  Then, when we process i6, we
would do the following:
1. Recall the canonicalization `sfp - 4 -> y`, and track the access relative to
y instead of x.
2. Add the canonicalization `x -> y - 4` to the table.

Thus both accesses are tracked relative to the same `MEM_EXPR` base y, and we
can see the adjacency and form the pair.

As an aside, I believe this approach can be further generalized in the presence
of section anchors to handle cases such as the following:

```c
int x,y;
void f(void) { x = y = 0; }
```

for which we currently generate:

```
f:
        adrp    x0, .LANCHOR0
        add     x1, x0, :lo12:.LANCHOR0
        str     wzr, [x0, #:lo12:.LANCHOR0]
        str     wzr, [x1, 4]
        ret
```

but this could be an stp with x1 as the base.  We could simply extend the above
approach to add section anchors as a third kind of base, at the top of the
hierarchy, giving:
```
RTL-SSA defs < MEM_EXPR decls < section anchors
```
which should enable us to handle such opportunities too.

[Bug rtl-optimization/121857] New: pair-fusion should replace the core ldp/stp peepholes on aarch64

Reply via email to