https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441
--- Comment #22 from Tamar Christina <tnfchris at gcc dot gnu.org> --- for me with `-fno-vect-cost-model` on without this commit we generate https://gist.github.com/Mistuke/d9252bfcb2aa766327c5f377e162f5b7 for the loop and with the commit well.. it doesn't fit on the screen but the codegen is pretty horrible with smlal2 v24.4s, v13.8h, v5.8h smull v31.4s, v30.4h, v17.4h add v20.4s, v20.4s, v11.4s smlal2 v29.4s, v3.8h, v6.8h smull2 v25.4s, v25.8h, v15.8h add v22.4s, v28.4s, v22.4s shrn v21.4h, v21.4s, 15 add v20.4s, v20.4s, v26.4s add v29.4s, v29.4s, v24.4s smlal2 v25.4s, v16.8h, v7.8h smlal v31.4s, v18.4h, v8.4h smull2 v27.4s, v27.8h, v17.8h shrn2 v21.8h, v22.4s, 15 add v29.4s, v29.4s, v25.4s add v31.4s, v31.4s, v20.4s smlal2 v27.4s, v18.8h, v8.8h str h21, [x5, x9] add x9, x9, 32 add x9, x5, x9 shrn v31.4h, v31.4s, 15 st1 {v21.h}[1], [x10] add v27.4s, v27.4s, v29.4s st1 {v21.h}[2], [x6] add x6, x7, 20 add x10, x1, x21 st1 {v21.h}[3], [x2] add x2, x7, 24 add x7, x7, 28 st1 {v21.h}[4], [x8] shrn2 v31.8h, v27.4s, 15 st1 {v21.h}[5], [x6] lsl x6, x10, 1 add x10, x5, x10, lsl 1 st1 {v21.h}[6], [x2] add x2, x10, 4 st1 {v21.h}[7], [x7] add x7, x10, 8 str h31, [x5, x6] add x8, x10, 12 lsl x1, x1, 1 add x6, x6, 32 st1 {v31.h}[1], [x2] add x2, x10, 16 st1 {v31.h}[2], [x7] add x7, x10, 20 st1 {v31.h}[3], [x8] add x8, x10, 24 add x10, x10, 28 st1 {v31.h}[4], [x2] st1 {v31.h}[5], [x7] add x11, x1, 32 st1 {v31.h}[6], [x8] add x11, x0, x11 st1 {v31.h}[7], [x10] add x10, x1, x25 ld1h z31.s, p5/z, [x11] going on for a while. i.e. single element lane stores. So with the cost model disabled, it definitely does get worse witht that commit. with the cost model on there's no difference.