https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123631
--- Comment #9 from Peter Cordes <pcordes at gmail dot com> ---
(In reply to Peter Cordes from comment #8)
> * Register-file read/write ports: An instruction like vptestmd (%rdi,
> %rcx), %ymm1, %k0{%k1} has two mask operands and two GPR operands, all of
> which are read and k1 is written (zero-masking). Intel CPUs currently don't
> micro-fuse it
On second though, that argument doesn't make sense. Even if micro-fused, it
would still execute in the unfused-domain (scheduler / execution units) as two
separate uops, a load and a test-into-mask. The scheduler uses separate
entries for the two uops in Sandybridge and later, and of course register-file
reads are separate.
There's still maybe something to be said about register-file read and write
ports, although the integer GPR PRF already needs a *lot* of read and write
ports to handle as many 3-input single-uop integer instructions (like CMOV or
ADC) as there are integer ALU ports, plus loads (2 reads + a write) and
store-data (1 read) uops. When AVX-512 was new, it was designed for Larrabee /
KNL, a totally separate uarch from current P-cores.
Having integer and SIMD ALUs on the same execution port was an Intel P-core
thing (dating to P6) which most other uarches don't do (and Lion Cove also
changed away from it), so using GPRs as mask registers would increase the peak
number of GPR register reads+writes in a single cycle if some SIMD instructions
could exec in the same cycle as the usual max number of integer instructions.
(e.g. after a load result becomes available that a lot of uops were waiting
for, even if the front-end isn't that wide.)