https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123603
--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
Testcase, needs -O2 -ftree-vectorize (a better cost model than 'very-cheap',
since while we allow peeling for niter we do not allow peeling for gaps).
void foo (int *block)
{
for (int i = 0; i < 3; ++i)
{
int a = block[i*9];
int b = block[i*9+1];
block[i*9] = a + 10;
block[i*9+1] = b + 10;
}
}
Apart from the fact that we fail to consider the SSE/2 version because of
already mentioned reason we fail to realize that peeling for gaps isn't
necessary if we'd emit short loads - but in this case the odd stride of 9
together with a VF of 2 and doing contiguous loads means we load
3 elements in the last meaningful V4SI which is not power-of-two.
Realizing this and realizing gap > nunits we might fall back to strided
operation (VMAT_STRIDED_SLP) in such cases. That would allow for
cheaper vector composition, on x86 a movq + movhps.