17 Regression] ipa-cp over-clones under a guessed profile: devirtualization_time_bonus frequency-weighting (r16-3990-gad3fb999a1b568) inflates the bonus, ~8% slower cc1 build (SPEC 721.gcc_r)

ptomsich at gcc dot gnu.org via Gcc-bugs Tue, 16 Jun 2026 14:29:24 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125809


--- Comment #6 from ptomsich at gcc dot gnu.org ---
Many more measurements larger (and tracing through more dump files than is
comfortable), here's more information that should provide a better
understanding of what is happening:

1. To everyone's surprise: it's a backend stall, not a frontend stall.  What
happens is that *after cloning* inlining and unrolling happens and pushes the
GENERAL_REGS up to 28 (with 27 available); we spill a loop-carried counter
(i.e., an induction variable) because the cost is much smaller than for any
MEM.

We end up with the following assembly (fast is before the commit I flagged):

; fast — counter in register                 ; slow — counter spilled to stack
add  w21, w21, #1                            ldr  w14, [sp, #0x78]   ; reload
                                             add  w5,  w14, #1
                                             str  w5,  [sp, #0x78]   ; store
back


The summary for the two clones:

| `process_bb` clone | allocnos | peak GPR pressure | spilled allocnos | total
alloc cost | **spill (mem) cost** | **cost / spill** |
|---|---|---|---|---|---|---|
| `.constprop.0` (iterate=0) | 1713 | **GENERAL_REGS=28** | 28 | 911 | 534 |
19.1 |
| **`.constprop.1` (iterate=1, HOT)** | 1272 | **GENERAL_REGS=28** | 25 | 1432
| **736** | **29.4** |

This translates to a ~1.6-1.8% regression on a Neoverse-N1 and a larger one on
an AmpereOne (it's not a bigger penalty on the u-arch, but rather an
optimisation in the u-arch that gets defeated...).

2. The exact chain of events (as far as we could trace it today):

2.i. ad3fb (Honza's IPA-CP cost-model improvement): frequency-weights the
devirtualization bonus. Under a guessed profile this over-values specializing,
so IPA-CP creates the specialized clone in which the indirect VN-dispatch call
becomes a direct, inlinable edge. Without ad3fb that edge is either indirect or
not valued highly enough to exist. This is the necessary precondition for the
problem to trigger.

2.ii. IPA-inline: the now-direct edge sits in a doubly-nested loop, so its
guessed frequency is large, which makes "numerator = inlining_speedup × ..."
large and "badness = −numerator/denominator" strongly negative. The same
guessed frequency makes "e->maybe_hot_p" true, so in
"want_inline_small_function_p" the call takes the hot path that relaxes the
size limits.
Result: the body is pulled into the loop. big_speedup_p is also true here —
which is exactly why my first cut's !big_speedup_p escape was wrong and I
dropped it.

2.iii. post-IPA: -funroll-loops (4×) on the now-fat loop body pushes
simultaneous live GP values to ~30, exceeding the 31 GPR file after
coalescing/clobbers, and RA spills the loop-carried VN counter. The
store/load-forwarding recurrence causes the runtime regression.

3. The original proposal (from the report) is a big-hammer that keeps the
overall process_bb just on the good side of the register-pressure.


So here's the updated proposal:
Teach ipa-inline to decline pulling a register-hungry body into an
already-pressured loop nest: at analyze_function_body time, we estimate each
body's peak register pressure. In want_inline_small_function_p, when the call
sits at loop depth >= D and caller + callee combined GP pressure exceeds a
limit L, refuse the edge.

Does this sound like a more reasonable design?

[Bug ipa/125809] [16/17 Regression] ipa-cp over-clones under a guessed profile: devirtualization_time_bonus frequency-weighting (r16-3990-gad3fb999a1b568) inflates the bonus, ~8% slower cc1 build (SPEC 721.gcc_r)

Reply via email to