[Bug target/123631] Odd choice for vector constant materialization

pcordes at gmail dot com via Gcc-bugs Mon, 19 Jan 2026 07:51:40 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123631


--- Comment #8 from Peter Cordes <pcordes at gmail dot com> ---
(In reply to Richard Biener from comment #7)
> As for register files I'd have expected Intel to have a unified GPR and
> k-mask register file given their lack of split integer / FP domains.  OTOH
> the weird setup of having separate k* instructions is a sign they don't.

Intel does have separate SIMD-integer and SIMD-FP domains with bypass latency,
e.g. you'll see 1c extra latency for VPADDD on a VMULPS result and vice versa. 
(And like I mentioned that extra latency "infects" the critical path through
the other operand even long after the VPADDD is retired.  At least on
Haswell/Skylake and presumably later.)

*Some* execution units (notably shuffles) are connected to both forwarding
networks so can be used without penalty either way, because those execution
units are expensive to build.  And also bitwise boolean because they're often
useful, I think.  Unlike blend units where they probably replicate them, with
an integer blend unit connected to the SIMD-integer forwarding network for
(V)PBLENDVB / (V)PBLENDW / VPBLENDD, and a separate unit for (V)BLEND(V)P[SD]
on the FP forwarding network.  It costs an extra cycle of latency on input and
output to use VBLENDPS instead of VPBLENDD between VPADDD instructions.  (Which
can still be better for throughput if AVX2 VPBLENDD isn't available.)  Both of
those forwarding networks do write-back to the same vector physical register
file since the instructions work on the same X/Y/ZMM registers.

A couple major CPU-design considerations are involved here:

 * Not needing too many long wires, both for propagation delay and drive
strength, and the difficulty of routing them past each other.  So putting a
separate register file near the SIMD/FP execution units is a lot better for
that than having every SIMD execution unit needing to be able to read the GPR
register file.  (Which has to be near all the integer execution units and
load/store stuff.)  Modern CPUs already have many layers of tiny wires running
above the silicon since it's unavoidable to need some (like for the one
execution unit that handles movd/movq/ptest/ucomisd and other SIMD->integer
operations.)

 * Register-file read/write ports: An instruction like vptestmd (%rdi, %rcx),
%ymm1, %k0{%k1} has two mask operands and two GPR operands, all of which are
read and k1 is written (zero-masking).  Intel CPUs currently don't micro-fuse
it, decoding as a separate load (even with a non-indexed addressing mode), but
architecting AVX-512 to use GPRs as mask regs like vmovaps (%rdi), %zmm0{%ax}
would make it impossible to use a separate register file.  And would mean one
instruction could read 4 entries from the GPR register file (and write one).  

Intel CPUs can only track 3 input dependencies to a single uop (since Haswell,
which added FMA).  It probably makes scheduling easier to have k registers be
separate instead of using GPRs as masks.

 You're right that even with separate k architectural registers, they could in
theory share a PRF with GPRs as an implementation detail, like how k regs
actually share a PRF with x87/MMX.  Then having k regs be separate would be a
matter of reducing register pressure for code that needs a lot of pointers /
counters and masks.  And they'd still need to provide instructions like kmov
and kadd.  (But could in theory provide a lot more, like kmul / kpopcnt / kpdep
/ kpext, although they'd probably have a separate forwarding network than the
GPRs.)

 Another factor is that using up 8 integer-PRF entries on the architectural
state for the k regs would effectively reduce its capacity by 8 all the time,
even when no instructions are reading or writing k mask registers.  PRF size is
one of the limiting factors for out-of-order window size, especially in pure
integer code since none of the instructions are writing any other kind of
register.  (Only Intel P6-family had a separate retirement-register-file to
hold cold values, since it didn't have PRFs, it kept in-flight results in the
ROB itself, the reorder buffer.)

Hope that helps with your intuition for what makes sense in CPU design.

[Bug target/123631] Odd choice for vector constant materialization

Reply via email to