Re: [PATCH][RFC] Allow the target to request a masked vector epilogue

Richard Biener Tue, 20 May 2025 01:06:32 -0700

On Mon, 19 May 2025, Richard Sandiford wrote:

Richard Biener <rguent...@suse.de> writes:

On Fri, 16 May 2025, Richard Sandiford wrote:

The simple prototype below uses a separate flag from the epilogue
mode, but I wonder how we want to more generally want to handle
whether to use masking or not when iterating over modes.  Currently
we mostly rely on --param vect-partial-vector-usage.  aarch64
and riscv have both variable-length modes but also fixed-size modes
where for the latter, like on x86, the target couldn't request
a mode specifically with or without masking.  It seems both
aarch64 and riscv fully rely on cost comparison and fully
exploiting the mode iteration space (but not masked vs. non-masked?!)
here?


I was thinking of adding a vectorization_mode class that would
encapsulate the mode and whether to allow masking or alternatively
to make the vector_modes array (and the m_suggested_epilogue_mode)
a std::pair of mode and mask flag?


Predicated vs. non-predicated SVE is interesting for the main loop.
The class sounds like it would be useful for that.

I suppose predicated vs. non-predicated SVE is also potentially
interesting for an unrolled epilogue, although there, it would in
theory be better to predicate only the last vector iteration
(i.e. part predicated, part unpredicated).


Yes, the latter is what we want for AVX512, keep the main loop
not predicated but have the epilog predicated (using the same VF).


Reading it back, what I said was very ambiguous (as usual, unfortunately).
What I actually meant was that if we had, say, a 4x unrolled main loop
and a 2x unrolled first epilogue loop, we'd in theory want the 2x
unrolled epilogue loop to use unpredicated operations for the first
VF/2 elements and predicted operations for the second VF/2 elements.

That way, we get the benefit of the 2x unrolling for residues of >VF
elements, but skip to a second epilogue if there are VF or fewer
remaining elements.

That example assumes that the last quarter of each iteration of the
main loop is predicated in a similar way, with the rest of the iteration
being unpredicated.


Yes, so this would work by requesting a fixed-size VF/2 first epilog
and a VF/2 fixed-size but masked second epilog.  As you have distinct
modes for masked/non-masked this should already work by means of the
m_suggested_epilogue_mode field in the costs the target can set.

Alternatively, we could have a fully-unpredicated 2x unrolled main
loop followed by the same kind of semi-predicated 2x unrolled
epilogue loop.

So if U == unpredicated and P == predicated:

         main loop: U U U P
 1st epilogue loop: U P
 2nd epilogue loop: P

 1st and 2nd epilogues might both be used

or:

         main loop: U U
 1st epilogue loop: U P
 2nd epilogue loop: P

 1st and 2nd epilogues are mutually exclusive

although the epilogues don't need to loop in either case.

So I suppose unpredicated SVE epilogue loops might be interesting
until that partial predication is implemented, but I'm not sure how
useful unpredicated SVE epilogue loops would be "once" the partial
predication is supported.

I don't imagine we'll often know a priori for AArch64 which type
of vector epilogue is best.  Since switching between SVE and
Advanced SIMD is assumed to be essentially free, I think we'll
still rely on the current approach of costing both and seeing
which is cheaper.


So the other case we might run into on x86 is if you have a
known loop tripcount but fully vectorizing the epilogue is
still not possible because while we have half-SSE, like V8QImode,
we don't have V4QI or V2QI, so even with multiple epilogues
we'd still end up with an iterating scalar epilog.  Those
cases might be good candidates for a predicated epilog as well.
So in the end we'd prefer branchless epilogues.


Yeah, branchless is also the aim of the schemes above.

Predication on x86 is quite a bit more expensive so I don't see
us using a predicated main vector loop anytime soon, and I'd
expect that to be the case for all archs when using a fixed-size
mode?  Is that the case for -msve-vector-bits=X as well?  Is
there an advantage for not using a predicated main vector loop?


I think it depends on the size of the loop.  I've seen large HPC
loops for which the overhead of predication and loop control is
subsumed by the inherent complexity of the work, and duplicating
the whole thing would probably be counterproductive.

But yeah, for tighter loops, SVE should benefit from unpredicated
main loops too.


I'll go for encapsulating the vectorization mode in a class given
Tamars feedback.  For now I'll not change the vector_modes array
contents but only the m_suggested_epilogue_mode target field.

Richard.

Thanks,
Richard

Re: [PATCH][RFC] Allow the target to request a masked vector epilogue

Reply via email to