https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89049

--- Comment #11 from Richard Biener <rguenth at gcc dot gnu.org> ---
Just an update on costs:

t.c:1:35: note:   === vect_compute_single_scalar_iteration_cost ===
0x483e120 *_3 1 times scalar_load costs 12 in body
0x483e120 _4 + r_16 1 times scalar_stmt costs 12 in body

and the vector body cost:

0x492f9d0 *_3 1 times unaligned_load (misalign -1) costs 20 in body
0x492f9d0 _4 + r_16 8 times vec_to_scalar costs 32 in body
0x492f9d0 _4 + r_16 8 times scalar_stmt costs 96 in body

That results in the overall (and sensible)

t.c:1:35: note:  Cost model analysis:
  Vector inside of loop cost: 148
  Vector prologue cost: 0
  Vector epilogue cost: 0
  Scalar iteration cost: 24
  Scalar outside cost: 0
  Vector outside cost: 0
  prologue iterations: 0
  epilogue iterations: 0
  Calculated minimum iters for profitability: 0

where for one vector iteration we have 8 scalar iterations, thus 24 * 8 = 192

As mentioned elsewhere the vectorizer cost model does not care for
pipeline latency or dependency issues nor execution resources competition.
It also does not care for loop size (the vector loop has one stmt more than
the unrolled scalar loop for example).  I once played with limiting the
vectorization loop growth with the unroll parameters, but we're far from
hitting those here.

Btw, a microbenchmark shows the loops execute in about the same time
vectorized with -mavx2 compared to scalar and not unrolled.  When
the scalar loop is unrolled 8 times the runtime is the same again
(this is all benchmarked on a Haswell machine).  If you disregard noise
then the scalar unrolled loop is maybe a tid bit faster than the other
cases.

I believe the limiting factor is the dependence chain of the adds,
there's plenty of parallel execution resources to cope for uglyness
and friends.

This leaves the code bloat as regression I think.

Reply via email to