https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123225
--- Comment #10 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Victor Do Nascimento from comment #9) > > I wonder if for now (w/o the ability to elide the epilog, w/o the ability > > to use first-fault loads) we should restrict this to PGO when we have > > a more reliable expected iteration count to work with? Though as we > > do not have a histogram of actual loop iterations an estimated count > > of 10 can result from a mix of 1 and 20 loop iterations ... > > > > Plus eventually handling loops marked as force_vectorize (we do not > > yet have a #pragma users can use, but OMP SIMD marks loops this way). > > Yes, I do think that the poor handling of both prologue and epilogue at > present severely hurt the usefulness of this approach. As for the prologue, > AArch64 targets with SVE can considerably counter the performance hit by > implementing masking for alignment. This, in particular, is something I am > working on as a follow up to this work and will be looking to submit once we > are back in stage 1. Masking for alignment should work for all targets that can use a predicated loop, including x86 and risc-v. For GCC 16 we can consider adding a new --param so targets could opt to disable uncounted loop vectorization alltogether. I somehow had the impression that we'd land the code avoiding the scalar epilog re-doing the last vector iteration as well, but that didn't materialize. Without that profitability is even worse for high VF. The alignment prologue shouldn't be too bad in practice for not too small loops, it's really the epilog where we end up doing things twice that hurts for low iteration counts.
