[Bug tree-optimization/123225] [16 Regression] Overly-aggressive vectorization of uncounted loops

rguenth at gcc dot gnu.org via Gcc-bugs Wed, 14 Jan 2026 06:57:59 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123225


--- Comment #10 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Victor Do Nascimento from comment #9)
> > I wonder if for now (w/o the ability to elide the epilog, w/o the ability
> > to use first-fault loads) we should restrict this to PGO when we have
> > a more reliable expected iteration count to work with?  Though as we
> > do not have a histogram of actual loop iterations an estimated count
> > of 10 can result from a mix of 1 and 20 loop iterations ...
> > 
> > Plus eventually handling loops marked as force_vectorize (we do not
> > yet have a #pragma users can use, but OMP SIMD marks loops this way).
> 
> Yes, I do think that the poor handling of both prologue and epilogue at
> present severely hurt the usefulness of this approach. As for the prologue,
> AArch64 targets with SVE can considerably counter the performance hit by
> implementing masking for alignment.  This, in particular, is something I am
> working on as a follow up to this work and will be looking to submit once we
> are back in stage 1.

Masking for alignment should work for all targets that can use a predicated
loop, including x86 and risc-v.

For GCC 16 we can consider adding a new --param so targets could opt to
disable uncounted loop vectorization alltogether.  I somehow had the
impression that we'd land the code avoiding the scalar epilog re-doing
the last vector iteration as well, but that didn't materialize.  Without
that profitability is even worse for high VF.  The alignment prologue
shouldn't be too bad in practice for not too small loops, it's really
the epilog where we end up doing things twice that hurts for low iteration
counts.

[Bug tree-optimization/123225] [16 Regression] Overly-aggressive vectorization of uncounted loops

Reply via email to