https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122028

--- Comment #2 from Robin Dapp <rdapp at gcc dot gnu.org> ---
One interesting thing I noticed is that by now, at least for the VF=4 512-bit
vector case, it's pretty simple to achieve the result I want:

As we don't directly support the load permutation we fall back to
VMAT_ELEMENTWISE anyway.  For VMAT_ELEMENTWISE we can check if the load
permutation is consecutive, just as we do for VMAT_STRIDED_SLP, and "fall back"
to a grouped gather.  That's exactly what's needed already.

For smaller vectors the best vectorization route is IMHO not as clear but
whenever we have strided loads we should make use of them.  The issue just is
that we already have a vectorization scheme and a fallback is not as obviously
better as in the ELEMENTWISE or STRIDED_SLP case. 

I noticed that we now vectorize at VF=2 instead of VF=1 but this
results in slightly worse code as we perform more scalar loads that need to be
moved over to the vector domain.  I think we're also lacking a specific
const permutation in the backend which would help with the "new" scheme and
might push us towards profitability.  I guess that constitutes a regression and
we could still fix it in the backend.  I'll look into this next week.

Reply via email to