https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123631

--- Comment #10 from Richard Biener <rguenth at gcc dot gnu.org> ---
On a related (off-topic) note we see %kN register pressure issues, mainly in
cases where packing/unpacking is required due to different data-sizes.  IMO the
ISA missed the chance to allow sth like

        vpcmpd  $4, %zmm2, %zmm0, %k0
        vmovdqu64       %zmm1, (%rsi,%rax,2){%k0[0]}
        vmovdqu64       %zmm1, 64(%rsi,%rax,2){%k0[1]}

to use an (aligned) portion of a larger number of element mask.  Instead
we have to do sth like

        kshiftrw        $8, %k0, %k1

the reverse, packing of multiple %k to a single larger element %k
probably cannot be done w/o a kunpckbw or the like, a vpcmp to a sub-part
of %k0 would likely be awkward.

We're also trying to mimic SVE/RVV by fully masking loops which requires
to compute the loop mask from remaining scalar iterations.  We're doing

        leal    -16(%rdx), %ecx
        vpbroadcastd    %ecx, %zmm1
        vpcmpud $6, %zmm2, %zmm1, %k2

but having a separate scalar loop control because the above is quite high
latency if you'd follow that with a ktest + branch.  %zmm2 is just
{ 0, 1, 2 ,3 4, 5, ... } and %rdx/%ecx the remaining scalar iterations.

Reply via email to