https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125809
--- Comment #6 from ptomsich at gcc dot gnu.org ---
Many more measurements larger (and tracing through more dump files than is
comfortable), here's more information that should provide a better
understanding of what is happening:
1. To everyone's surprise: it's a backend stall, not a frontend stall. What
happens is that *after cloning* inlining and unrolling happens and pushes the
GENERAL_REGS up to 28 (with 27 available); we spill a loop-carried counter
(i.e., an induction variable) because the cost is much smaller than for any
MEM.
We end up with the following assembly (fast is before the commit I flagged):
; fast — counter in register ; slow — counter spilled to stack
add w21, w21, #1 ldr w14, [sp, #0x78] ; reload
add w5, w14, #1
str w5, [sp, #0x78] ; store
back
The summary for the two clones:
| `process_bb` clone | allocnos | peak GPR pressure | spilled allocnos | total
alloc cost | **spill (mem) cost** | **cost / spill** |
|---|---|---|---|---|---|---|
| `.constprop.0` (iterate=0) | 1713 | **GENERAL_REGS=28** | 28 | 911 | 534 |
19.1 |
| **`.constprop.1` (iterate=1, HOT)** | 1272 | **GENERAL_REGS=28** | 25 | 1432
| **736** | **29.4** |
This translates to a ~1.6-1.8% regression on a Neoverse-N1 and a larger one on
an AmpereOne (it's not a bigger penalty on the u-arch, but rather an
optimisation in the u-arch that gets defeated...).
2. The exact chain of events (as far as we could trace it today):
2.i. ad3fb (Honza's IPA-CP cost-model improvement): frequency-weights the
devirtualization bonus. Under a guessed profile this over-values specializing,
so IPA-CP creates the specialized clone in which the indirect VN-dispatch call
becomes a direct, inlinable edge. Without ad3fb that edge is either indirect or
not valued highly enough to exist. This is the necessary precondition for the
problem to trigger.
2.ii. IPA-inline: the now-direct edge sits in a doubly-nested loop, so its
guessed frequency is large, which makes "numerator = inlining_speedup × ..."
large and "badness = −numerator/denominator" strongly negative. The same
guessed frequency makes "e->maybe_hot_p" true, so in
"want_inline_small_function_p" the call takes the hot path that relaxes the
size limits.
Result: the body is pulled into the loop. big_speedup_p is also true here —
which is exactly why my first cut's !big_speedup_p escape was wrong and I
dropped it.
2.iii. post-IPA: -funroll-loops (4×) on the now-fat loop body pushes
simultaneous live GP values to ~30, exceeding the 31 GPR file after
coalescing/clobbers, and RA spills the loop-carried VN counter. The
store/load-forwarding recurrence causes the runtime regression.
3. The original proposal (from the report) is a big-hammer that keeps the
overall process_bb just on the good side of the register-pressure.
So here's the updated proposal:
Teach ipa-inline to decline pulling a register-hungry body into an
already-pressured loop nest: at analyze_function_body time, we estimate each
body's peak register pressure. In want_inline_small_function_p, when the call
sits at loop depth >= D and caller + callee combined GP pressure exceeds a
limit L, refuse the edge.
Does this sound like a more reasonable design?