https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123631
--- Comment #10 from Richard Biener <rguenth at gcc dot gnu.org> ---
On a related (off-topic) note we see %kN register pressure issues, mainly in
cases where packing/unpacking is required due to different data-sizes. IMO the
ISA missed the chance to allow sth like
vpcmpd $4, %zmm2, %zmm0, %k0
vmovdqu64 %zmm1, (%rsi,%rax,2){%k0[0]}
vmovdqu64 %zmm1, 64(%rsi,%rax,2){%k0[1]}
to use an (aligned) portion of a larger number of element mask. Instead
we have to do sth like
kshiftrw $8, %k0, %k1
the reverse, packing of multiple %k to a single larger element %k
probably cannot be done w/o a kunpckbw or the like, a vpcmp to a sub-part
of %k0 would likely be awkward.
We're also trying to mimic SVE/RVV by fully masking loops which requires
to compute the loop mask from remaining scalar iterations. We're doing
leal -16(%rdx), %ecx
vpbroadcastd %ecx, %zmm1
vpcmpud $6, %zmm2, %zmm1, %k2
but having a separate scalar loop control because the above is quite high
latency if you'd follow that with a ktest + branch. %zmm2 is just
{ 0, 1, 2 ,3 4, 5, ... } and %rdx/%ecx the remaining scalar iterations.