https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123190
--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> --- GCC 16 (AVX2) cost model: t.c:11:14: note: Cost model analysis: Vector inside of loop cost: 1292 Vector prologue cost: 316 Vector epilogue cost: 1488 Scalar iteration cost: 1488 Scalar outside cost: 8 Vector outside cost: 1804 prologue iterations: 0 epilogue iterations: 1 Calculated minimum iters for profitability: 2 GCC 16 (SSE2) cost model [-mprefer-vector-width=128]: t.c:11:14: note: Cost model analysis: Vector inside of loop cost: 956 Vector prologue cost: 316 Vector epilogue cost: 0 Scalar iteration cost: 1488 Scalar outside cost: 8 Vector outside cost: 316 prologue iterations: 0 epilogue iterations: 0 Calculated minimum iters for profitability: 1 t.c:11:14: note: Runtime profitability threshold = 1 given constant iteration count the SSE2 loop costs 956 * 3 vs. AVX2 1292 + 1488 which makes the AVX2 loop cheaper on paper with the very much higer outside cost turning the tide to the SSE2 vector width. Iff we were actually going to do cost comparison on x86. It's too late to do this for GCC 16 though. With -mprefer-vector-width=128 GCC 16 is faster overall and the difference in this function is much lower.
