https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89049
--- Comment #11 from Richard Biener <rguenth at gcc dot gnu.org> --- Just an update on costs: t.c:1:35: note: === vect_compute_single_scalar_iteration_cost === 0x483e120 *_3 1 times scalar_load costs 12 in body 0x483e120 _4 + r_16 1 times scalar_stmt costs 12 in body and the vector body cost: 0x492f9d0 *_3 1 times unaligned_load (misalign -1) costs 20 in body 0x492f9d0 _4 + r_16 8 times vec_to_scalar costs 32 in body 0x492f9d0 _4 + r_16 8 times scalar_stmt costs 96 in body That results in the overall (and sensible) t.c:1:35: note: Cost model analysis: Vector inside of loop cost: 148 Vector prologue cost: 0 Vector epilogue cost: 0 Scalar iteration cost: 24 Scalar outside cost: 0 Vector outside cost: 0 prologue iterations: 0 epilogue iterations: 0 Calculated minimum iters for profitability: 0 where for one vector iteration we have 8 scalar iterations, thus 24 * 8 = 192 As mentioned elsewhere the vectorizer cost model does not care for pipeline latency or dependency issues nor execution resources competition. It also does not care for loop size (the vector loop has one stmt more than the unrolled scalar loop for example). I once played with limiting the vectorization loop growth with the unroll parameters, but we're far from hitting those here. Btw, a microbenchmark shows the loops execute in about the same time vectorized with -mavx2 compared to scalar and not unrolled. When the scalar loop is unrolled 8 times the runtime is the same again (this is all benchmarked on a Haswell machine). If you disregard noise then the scalar unrolled loop is maybe a tid bit faster than the other cases. I believe the limiting factor is the dependence chain of the adds, there's plenty of parallel execution resources to cope for uglyness and friends. This leaves the code bloat as regression I think.