[Bug target/123631] Odd choice for vector constant materialization

pcordes at gmail dot com via Gcc-bugs Sun, 18 Jan 2026 21:11:27 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123631


--- Comment #5 from Peter Cordes <pcordes at gmail dot com> ---
(In reply to Richard Biener from comment #2)
> (In reply to Hongtao Liu from comment #1)
> > It's done by r12-1958, it's better for dcache, but worse for icache, small
> > benchmark in the commit show broadcast from integer is slightly better than
> > constant pool, maybe we should make it as a u-arch specific tuning.
> 
> I see it was benchmarked on Intel CPU which have a shared register file, I
> was specifically wondering of the AMD case where any integer <-> FP/vector
> boundary crossing incurs a latency penalty.

Intel and AMD CPUs both have 3 separate register files: integer, SIMD/FP, and
x87/k-mask registers.  (I forget if AMD shares a PRF for x87 and AVX-512 mask
registers, but they both definitely have separate PRFs for GPR vs. X/Y/ZMM. 
The chipsandcheese Zen 5 article shows it using a separate EFLAGS register
file, too; Intel has enough extra bits in every integer PRF entry to hold a GPR
+ condition-codes result.)
https://www.realworldtech.com/haswell-cpu/6/
https://en.wikichip.org/wiki/amd/microarchitectures/zen_2#Block_Diagram
https://chipsandcheese.com/p/lion-cove-intels-p-core-roars
https://chipsandcheese.com/p/amds-ryzen-9950x-zen-5-on-desktop

The difference is in the latency of domain-crossing uops that read integer and
write SIMD/FP or vice versa.  It's lower on Intel P-cores, higher on AMD and on
Intel E-cores.  That's probably a matter of how close the wiring physically is.
And/or perhaps in building a scheduler that can track dependencies across both
domains?  If the latter is a factor, Intel (before Lion Cove) uses a
mostly-unified scheduler with entries for vector or integer ALU, and another
group of entries for memory.  Lion Cove has separate scheduler entries that can
only hold vector or only hold integer.  AMD has a couple separate scheduling
queues for groups of SIMD/FP execution ports.  Intel E-cores are like that,
too.

(https://stackoverflow.com/questions/72577590/why-does-movd-movq-between-gp-and-simd-registers-have-quite-high-latency)


> In reality what is faster always depends on the surrounding code, but
> IMO code size (and uop cache space) wins easily.  Quite likely
> doing full vector loads from the constant pool for XMM initialization
> is better than broadcast from scalar, possibly even YMM, for the same
> reason.

It depends on the code.  Simple code that processes a *lot* of data can have a
relatively small I-cache footprint while churning up D-cache so much that
constants are likely evicted.  Maybe they survive in L3 thanks to an adaptive
replacement policy that tries to avoid having loops over huge working sets blow
everything away.

In code with really long-running loops, the cost of setting up constants is
amortized over many iterations so it barely matters, but it gets more
interesting if cache-blocking so you're alternating between a few hot
functions.

I'd be very skeptical of full-vector constants unless that enables using them
as a memory source operand (without AVX-512 for embedded-broadcast) for other
ops.  Even then, that is a big tradeoff.
VPBROADCASTD loads are single uop and have the same throughput and latency as
VMOVUPS.

D-cache pressure is a real thing, too.  I was discussing this with Maxim
Egorushkin recently; he was saying
(https://stackoverflow.com/questions/30674291/how-to-check-inf-for-avx-intrinsic-m256/79732626#comment140653747_31203100)
that things like  
asm("":"+v"(x), "+r"(y)); 
were often necessary to block constant-propagation into a load from .rodata
when a purely-ALU strategy for materializing a constant was better, especially
cases like VPCMPEQD / VPSRLD to make a mask like 0x7FFFFFFF.  (Linear algebra
HPC on Zen 3, where apparently L1d cache pressure is a big deal for his
use-case, so loading unnecessary constants was a big problem.)

So yeah, speaking of that, set1_epi32(0x7FFFFFFF) should definitely use
VPCMPEQD / VPSRLD, especially if you also need an all-ones constant for
something else or can make other constants from the same all-ones. 
(https://stackoverflow.com/questions/35085059/what-are-the-best-instruction-sequences-to-generate-vector-constants-on-the-fly)

(But on Intel CPUs, beware that a register value written by integer
instructions adds extra latency for *both* operands to FP instructions using it
indefinitely, long after its written back.  So materializing an FP constant
like 1.0 from an all-ones from VPCMPEQD could infect the critical path using it
with extra latency until the next XSAVE/XRSTOR. 
https://stackoverflow.com/questions/64116679/haswell-avx-fma-latencies-tested-1-cycle-slower-than-intels-guide-says
- it's a thing on Skylake and Haswell at least.  IDK if the effect is present
on later uarches.)

> I'm not sure how to best do a micro-benchmark measuring the actual
> latency of the variants in question.

Latency would usually only be relevant after a branch mispredict or I-cache
miss or other stall.  Normally out-of-order exec can get constants materialized
ahead of runtime-variable data being ready.  (Unless we get a d-cache miss
loading from .rodata, in which case it can be a big stall).

Throughput is a much more relevant metric for use-cases that don't get hoisted
out of loops.  In that case a 4-byte broadcast-load is the obvious winner
especially without AVX-512: only costing a tiny bit extra code size vs. a full
vector, no extra uops (not even in the back-end) and saving a lot of D-cache
footprint which is nice when you have multiple constants.  (Especially which
get loaded together.)

Your microbenchmark measures throughput for the hot-cache case, which of course
makes loads look good.  It also makes I-cache and D-cache pressure irrelevant.

4-byte is probably the narrowest we should go for memory source: VPBROADCASTB/W
cost an extra ALU shuffle uop in the back-end, which is worse for latency and
competes for the shuffle port which can often be a bottleneck in SIMD
algorithms that do any shuffling.  (So competing with the previous loop, for
example).  VPBROADCASTB/W do stay micro-fused so still only take up 1 slot in
the uop-cache and even in the ROB, though.
VPBROADCASTD/Q (and VPBROADCASTF/I128) are single-uop, handled entirely by the
load execution unit on Intel and AMD.

Wide non-broadcast load can be worth considering for XMM when we don't have AVX
available for VBROADCASTSD or AVX2 for VPBROADCASTD.  The extra front-end cost
of MOVD + PSHUFD instead of just a single load is kinda bad.

AVX-512 makes MOV-immediate + ALU a lot more attractive since VPBROADCAST GPR
to X/Y/ZMM  saves an instruction.  That seems like a good choice when
available, if we can't get it from an all-ones vector in one instruction like a
shift or PABSB/W/D (for set1(1)) or something.

I don't know what strategy is best as a default, especially for just AVX2, but
for most programs I'd be surprised if full-width non-broadcast vectors is
better.  (Again, unless they can't hoist their constants because there aren't
loops or because of register pressure, and full-width enables using
memory-source operands.)

[Bug target/123631] Odd choice for vector constant materialization

Reply via email to