https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123631

--- Comment #12 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Peter Cordes from comment #11)
> (In reply to Richard Biener from comment #10)
[...]
> > We're also trying to mimic SVE/RVV by fully masking loops which requires
> > to compute the loop mask from remaining scalar iterations.  We're doing
> > 
> >         leal    -16(%rdx), %ecx
> >         vpbroadcastd    %ecx, %zmm1
> >         vpcmpud $6, %zmm2, %zmm1, %k2
> > 
> > but having a separate scalar loop control because the above is quite high
> > latency if you'd follow that with a ktest + branch.  %zmm2 is just
> > { 0, 1, 2 ,3 4, 5, ... } and %rdx/%ecx the remaining scalar iterations.
> 
> Yeah, AVX-512 is awkward for this compared to the more recent architectures
> with more support for stuff like this.
> 
> If we can spare 2 vectors across the loop, a single loop-carried VPADDD or
> VPSUBD doing v -= 16 can replace LEA / VPBROADCASTD to set up for the
> compare-into-mask.  So a fully separate dep chain computing the same thing
> as the scalar loop condition.  But under register pressure, falling back to
> this method is probably better than reload a loop-invariant, and definitely
> better than spilling/reloading.
> 
> For long-running loops, an extra few cycles once at the end might not be
> critical if we save any cycles from avoiding the throughput cost of the loop
> overhead.  Otherwise yeah, the last iteration of a long-running loop does
> usually mispredict so letting OoO exec get there quickly is good.  Front-end
> cycles are wasted on the wrong path until it can get re-steered, so fewer
> front-end uops isn't helping surrounding code, at least not after the loop. 
> So a separate scalar loop condition does make sense.
> 
> I wonder if there's something we can do with bzhi(-1, remaining) / kmov... 
> Probably not without more instructions; BZHI does saturate the bit-index,
> but only looks at the low 8 bits of the source, not all 32 or 64.

One reason for the scalar variable and the broadcast is that we are
computing the mask for the vector type with the most elements which
means for V64QI we only have unsigned char for the counter, so the actual
thing we do is

  ivtmp_83 = ivtmp_82 - 64;
  _84 = MIN_EXPR <ivtmp_83, 64>;
  _85 = (unsigned char) _84;
  _86 = {_85, _85, ... };
  _91 = { 0, 1, 2, 3, ... } < _86;
  if (ivtmp_82 > 64)
    continue;

where ivtmp_83 is the remaining number of iterations.  We do resort to
computing multiple masks and packing them in case the unroll factor does
not allow representing the capped remaining iteration count, so we have
to increase the compare vector element size.  SVEs while.ult is so
much nicer for this - I suppose AVX 10.x could add GPR compare-to-mask
instructions (with variants for b, w, d, not sure about two and four bit
masks).  But yes, if we know the iteration counts fit the natural type
we use we could use two vector registers and elide the broadcast and GPR
operation.

Reply via email to