Richard,
> One reason for the single-use check is that we want to avoid the
> transform for a loop exit check where the result prevents coalescing
> of the in-loop IV before/after update and thus requires a non-empty
> latch block. IIRC there's code that tries to fixup during out-of-SSA,
> but please double-check this actually works
I've investigated this. insert_backedge_copies() in tree-outof-ssa.cc
(lines 1289-1316) handles exactly this pattern: when a PHI result is
used in an EQ/NE condition and the PHI arg is defined as
result +/- INTEGER_CST, it adjusts the condition to use the
post-increment value, restoring coalescability.
Empirically, I built the compiler with and without the patch and
compared assembly on three targets for several IV loop patterns
designed to trigger the conflict (tight loops with unknown start/bound,
with and without additional IV uses). No extra copy appeared in any
inner loop.
Results for the hot inner loops:
tight_unknown (tight loop, unknown IV start):
x86-64: patched slightly better (fewer callee-saves, testl vs cmpl $1)
AArch64: patched better (4 vs 5 insns -- cbnz fuses cmp+branch;
1 vs 2 callee-saves)
RISC-V: patched slightly worse (4 vs 3 insns -- see below)
tight_use_after (IV + array access in loop body):
x86-64: patched slightly worse (8 vs 7 insns -- different IVOPTS
addressing)
AArch64: neutral (8 vs 8 insns)
RISC-V: neutral (6 vs 6 insns)
The differences are not from coalescing failures but from downstream
pass decisions. The one RISC-V regression in tight_unknown is a
static profile estimation artefact: the compare-against-zero heuristic
assigns 74% probability to `i == 0` (vs 20% for the baseline's
`i+1 == 1`), causing the cold sink() call path to become the
fall-through. This does not occur with PGO.
Note that this relaxation only fires when the folded constant is
zero, i.e., (X + C) == C -> X == 0. In a loop IV context this means
the check is true only when the IV is zero (typically the first
iteration), which compilers tend to peel or constant-fold anyway.
The primary beneficiary is non-loop code like the motivating case
(++*a == 1 -> *a == 0).
Is the patch OK as-is, or would you prefer an additional guard
(e.g., checking that @3 is not defined inside a loop)?
Philipp