https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122028
--- Comment #2 from Robin Dapp <rdapp at gcc dot gnu.org> --- One interesting thing I noticed is that by now, at least for the VF=4 512-bit vector case, it's pretty simple to achieve the result I want: As we don't directly support the load permutation we fall back to VMAT_ELEMENTWISE anyway. For VMAT_ELEMENTWISE we can check if the load permutation is consecutive, just as we do for VMAT_STRIDED_SLP, and "fall back" to a grouped gather. That's exactly what's needed already. For smaller vectors the best vectorization route is IMHO not as clear but whenever we have strided loads we should make use of them. The issue just is that we already have a vectorization scheme and a fallback is not as obviously better as in the ELEMENTWISE or STRIDED_SLP case. I noticed that we now vectorize at VF=2 instead of VF=1 but this results in slightly worse code as we perform more scalar loads that need to be moved over to the vector domain. I think we're also lacking a specific const permutation in the backend which would help with the "new" scheme and might push us towards profitability. I guess that constitutes a regression and we could still fix it in the backend. I'll look into this next week.
