https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123631
--- Comment #11 from Peter Cordes <pcordes at gmail dot com> ---
(In reply to Richard Biener from comment #10)
> On a related (off-topic) note we see %kN register pressure issues, mainly in
> cases where packing/unpacking is required due to different data-sizes.
> [...]
> the reverse, packing of multiple %k to a single larger element %k ...
Agreed, partial-register reads are basically fine and would have been nice to
have. Still just a read plus a shift, unlike partial-register writes which
turn into RMW merges. (Very unlikely they'd go for the craziness of
partial-register renaming like in P6 and still somewhat in SnB-family). And
with only a couple possibilities for the shift count, it's not as expensive as
a full barrel shifter.
You'd need a few bits in the machine code to signal an index aka chunk-shift
count. With 3 bits, that would be enough for any aligned 8-element chunk of a
64-bit kmask. APX found more spare bits in EVEX prefixes but only in 64-bit
mode, and only years later after time to see this wasn't causing problems with
toolchains or decoders.
AVX-512 was originally designed for Larrabee which became Xeon Phi, first a
literal GPU, then a compute accelerator, which only supported 32 and 64-bit
elements and only full-width (ZMM) vectors. I guess it's not too rare to mix
float and double, or i32 and i64, or i32 and double. But combinations like u8
and float or double with a factor of 4 or 8 in element size couldn't happen.
Still, even 1 bit for 2 halves of a mask would probably be all you need most of
the time. So it's unfortunate we don't have it, but I can see some historical
reasons that biased away from that.
> We're also trying to mimic SVE/RVV by fully masking loops which requires
> to compute the loop mask from remaining scalar iterations. We're doing
>
> leal -16(%rdx), %ecx
> vpbroadcastd %ecx, %zmm1
> vpcmpud $6, %zmm2, %zmm1, %k2
>
> but having a separate scalar loop control because the above is quite high
> latency if you'd follow that with a ktest + branch. %zmm2 is just
> { 0, 1, 2 ,3 4, 5, ... } and %rdx/%ecx the remaining scalar iterations.
Yeah, AVX-512 is awkward for this compared to the more recent architectures
with more support for stuff like this.
If we can spare 2 vectors across the loop, a single loop-carried VPADDD or
VPSUBD doing v -= 16 can replace LEA / VPBROADCASTD to set up for the
compare-into-mask. So a fully separate dep chain computing the same thing as
the scalar loop condition. But under register pressure, falling back to this
method is probably better than reload a loop-invariant, and definitely better
than spilling/reloading.
For long-running loops, an extra few cycles once at the end might not be
critical if we save any cycles from avoiding the throughput cost of the loop
overhead. Otherwise yeah, the last iteration of a long-running loop does
usually mispredict so letting OoO exec get there quickly is good. Front-end
cycles are wasted on the wrong path until it can get re-steered, so fewer
front-end uops isn't helping surrounding code, at least not after the loop. So
a separate scalar loop condition does make sense.
I wonder if there's something we can do with bzhi(-1, remaining) / kmov...
Probably not without more instructions; BZHI does saturate the bit-index, but
only looks at the low 8 bits of the source, not all 32 or 64.