https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106081
--- Comment #12 from Richard Biener <rguenth at gcc dot gnu.org> --- Btw, I see we actually materialize a permute before the splat: t.c:14:24: note: node 0x5b311c0 (max_nunits=1, refcnt=2) vector(2) double t.c:14:24: note: op: VEC_PERM_EXPR t.c:14:24: note: stmt 0 _1 = *k_50; t.c:14:24: note: stmt 1 _1 = *k_50; t.c:14:24: note: stmt 2 _1 = *k_50; t.c:14:24: note: stmt 3 _1 = *k_50; t.c:14:24: note: lane permutation { 0[3] 0[2] 0[1] 0[0] } t.c:14:24: note: children 0x5b30fc0 t.c:14:24: note: node 0x5b30fc0 (max_nunits=2, refcnt=1) vector(2) double t.c:14:24: note: op template: _1 = *k_50; t.c:14:24: note: stmt 0 _1 = *k_50; t.c:14:24: note: stmt 1 _1 = *k_50; t.c:14:24: note: stmt 2 _1 = *k_50; t.c:14:24: note: stmt 3 _1 = *k_50; t.c:14:24: note: load permutation { 0 0 0 0 } this is because vect_optimize_slp_pass::get_result_with_layout doesn't seem to consider load permutations? It's the value-numbering we perform that in the end elides the redundant permute. I have the feeling something doesn't fit together exactly, during materialize () we apply the chosen layout to the partitions, then eliminate redundant permutes but we still end up with get_result_with_layout adding the above permute. I can add the same logic as I've added to change_layout_cost also to get_result_with_layout but as said, it feels like I'm missing something ...