https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123225

--- Comment #12 from rguenther at suse dot de <rguenther at suse dot de> ---
On Wed, 14 Jan 2026, tnfchris at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123225
> 
> Tamar Christina <tnfchris at gcc dot gnu.org> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |tnfchris at gcc dot gnu.org
> 
> --- Comment #11 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
> (In reply to Richard Biener from comment #10)
> > (In reply to Victor Do Nascimento from comment #9)
> > > > I wonder if for now (w/o the ability to elide the epilog, w/o the 
> > > > ability
> > > > to use first-fault loads) we should restrict this to PGO when we have
> > > > a more reliable expected iteration count to work with?  Though as we
> > > > do not have a histogram of actual loop iterations an estimated count
> > > > of 10 can result from a mix of 1 and 20 loop iterations ...
> > > > 
> > > > Plus eventually handling loops marked as force_vectorize (we do not
> > > > yet have a #pragma users can use, but OMP SIMD marks loops this way).
> > > 
> > > Yes, I do think that the poor handling of both prologue and epilogue at
> > > present severely hurt the usefulness of this approach. As for the 
> > > prologue,
> > > AArch64 targets with SVE can considerably counter the performance hit by
> > > implementing masking for alignment.  This, in particular, is something I 
> > > am
> > > working on as a follow up to this work and will be looking to submit once 
> > > we
> > > are back in stage 1.
> > 
> > Masking for alignment should work for all targets that can use a predicated
> > loop, including x86 and risc-v.
> > 
> > For GCC 16 we can consider adding a new --param so targets could opt to
> > disable uncounted loop vectorization alltogether.  I somehow had the
> > impression that we'd land the code avoiding the scalar epilog re-doing
> > the last vector iteration as well, but that didn't materialize.  Without
> > that profitability is even worse for high VF.  The alignment prologue
> > shouldn't be too bad in practice for not too small loops, it's really
> > the epilog where we end up doing things twice that hurts for low iteration
> > counts.
> 
> Simple cases as the above can avoid the epilogue quite easily. During analysis
> of the loop we just have to determine if there are any non-early break forced
> IVs.
> 
> If not the epilogue isn't needed and the code that forces the epilogue can 
> just
> be turned off. After which the loop won't be peeled and the exits are fine.
> 
> What delayed this is when you DO have a live value, for which you then need to
> do masked based reductions which triggers a bunch of other issues to deal 
> with. 
> 
> So rather than restricting to PGO we could just handle the cases above and
> restrict uncounted loops to cases that don't require a forced epilogue.
> 
> That way when I finish the reductions next stage1 it just works.
> 
> The patches for the above are on my work machine, but I won't be back till the
> 23rd.
> 
> If you agree can extract them from the series and send.

Would be nice to have those on record indeed.

Reply via email to