https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125809
Bug ID: 125809
Summary: [16 Regression] ipa-cp over-clones under a guessed
profile: devirtualization_time_bonus
frequency-weighting (ad3fb999a1b) inflates the bonus,
~8% slower cc1 build (SPEC 721.gcc_r)
Product: gcc
Version: 16.1.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: ipa
Assignee: unassigned at gcc dot gnu.org
Reporter: ptomsich at gcc dot gnu.org
Target Milestone: ---
r16 commit ad3fb999a1b56893f0f6296a52fe2af550763fee "Improve ipa-cp
devirtualization costing" changed devirtualization_time_bonus to weight each
devirtualizable indirect call's saving by ie->combined_sreal_frequency().
That is correct with a real profile, but under a guessed/static profile (e.g.
-O2/LTO without PGO) the frequency of a hot, loop-nested function is
over-estimated. This pushes it past the per-value cloning threshold, creating
extra context clones.
Concretely it regresses SPEC CPU 2026 721.gcc_r (-Ofast -flto -mcpu=ampere1, no
PGO) by ~8% slower (319.2s -> 347.7 s).
tree-ssa-sccvn.cc:process_bb (called from do_rpo_vn) is split into two context
clones (iterate=0/iterate=1) instead of one.
Self-contained reproducer (gcc -O2 -fdump-ipa-cp-details t.c):
int sink;
extern int (*gp) (int);
static int cb (int x) {
int r = x;
r = r*3+1; r ^= r>>2; r += r<<3; r -= r>>1;
r = r*5+7; r ^= r>>4; r += r<<2; r -= r>>3;
r = r*9+2; r ^= r>>5; return r;
}
static int __attribute__((noinline))
worker (int (*fn)(int), int *a, int n, int m) {
int s = 0;
for (int j = 0; j < m; j++)
for (int i = 0; i < n; i++)
s += fn (a[i]);
return s;
}
void caller0 (int *a, int n, int m) { sink += worker (cb, a, n, m); }
void caller1 (int *a, int n, int m) { sink += worker (gp, a, n, m); }
Before ad3fb999a1b (and with our proposed fix) worker is not specialized: the
one hot indirect call is not worth a clone under a guessed profile. After
ad3fb999a1b it is cloned (Creating a specialized node of worker):
good_cloning_opportunity_p evaluation jumps from 153 to 4153 (threshold 500)
purely from the frequency factor.
Proposed fix (validated: process_bb back to one clone ... which recovers the
~8%): frequency-weight the bonus only when ie->count.reliable_p(); otherwise
use the unweighted saving (the pre-ad3fb behaviour). This keeps the improvement
for PGO/AFDO and avoids the over-cloning under guessed profiles.
We have a prototype patch that I can add, if this sounds like a good direction
to resolve this.