https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101097
rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |rsandifo at gcc dot gnu.org --- Comment #11 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> --- FWIW, you could try something similar to how aarch64 handles this for Advanced SIMD, with a combination of: - TARGET_VECTORIZE_AUTOVECTORIZE_VECTOR_MODES - TARGET_VECTORIZE_RELATED_MODE. We get the optimal code for these tests on aarch64, even when the loop vectoriser is used. E.g.: void bar (short unsigned int * p1, short unsigned int * p2, int * restrict p3) { vector(4) int vect__11.26; vector(4) int vect__8.25; vector(4) short unsigned int vect__7.24; vector(4) int vect__5.21; vector(4) short unsigned int vect__4.20; <bb 2> [local count: 214748371]: vect__4.20_34 = MEM <vector(4) short unsigned int> [(short unsigned int *)p1_15(D)]; vect__5.21_35 = (vector(4) int) vect__4.20_34; vect__7.24_38 = MEM <vector(4) short unsigned int> [(short unsigned int *)p2_16(D)]; vect__8.25_39 = (vector(4) int) vect__7.24_38; vect__11.26_40 = vect__5.21_35 + vect__8.25_39; MEM <vector(4) int> [(int *)p3_17(D)] = vect__11.26_40; return; } which for -O2 -ftree-vectorize is produced by the loop vectorizer rather than SLP.