https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123190
--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
I have a patch that improves behavior with -O3 where we unroll the inner loop
of mult_su3_na. With -O2 -flto -fprofile-use we do not do this and for
this function vectorizing the epilog isn't deemed profitable (despite the
fix). With -O3:
t.c:11:14: note: operating on full vectors for epilogue loop.
t.c:11:14: note: Cost model analysis:
Vector inside of loop cost: 956
Vector prologue cost: 288
Vector epilogue cost: 0
Scalar iteration cost: 1488
Scalar outside cost: 0
Vector outside cost: 288
prologue iterations: 0
epilogue iterations: 0
Calculated minimum iters for profitability: 1
t.c:11:14: note: Runtime profitability threshold = 1
but -O2:
t.c:12:16: note: operating on full vectors for epilogue loop.
t.c:12:16: note: Cost model analysis:
Vector inside of loop cost: 436
Vector prologue cost: 96
Vector epilogue cost: 0
Scalar iteration cost: 528
Scalar outside cost: 0
Vector outside cost: 96
prologue iterations: 0
epilogue iterations: 0
Calculated minimum iters for profitability: 2
t.c:12:16: note: Runtime profitability threshold = 2
note the number of epilog iterations is 1 (but with VF == 1 as well).
In particular load costing for splatpermutes looks off to me though it
matches what we code-generate:
t.c:11:14: note: ------>vectorizing SLP node starting from: ar_230 =
a_7(D)->e[i_217][1].real;
t.c:11:14: note: transform load.
t.c:11:14: note: create vector_type-pointer variable to type: vector(2)
double vectorizing a record based array ref: *a_7(D)
t.c:11:14: note: created vectp_a.125_546
t.c:11:14: note: add new stmt: vect_ar_230.126_549 = MEM <vector(2) double>
[(double *)vectp_a.124_547];
t.c:11:14: note: add new stmt: vectp_a.124_550 = vectp_a.124_547 + 16;
t.c:11:14: note: add new stmt: vect_ar_230.127_551 = MEM <vector(2) double>
[(double *)vectp_a.124_550];
t.c:11:14: note: add new stmt: vectp_a.124_552 = vectp_a.124_550 + 16;
gimple_simplified to vectp_a.124_552 = vectp_a.124_547 + 32;
t.c:11:14: note: add new stmt: vect_ar_230.128_553 = MEM <vector(2) double>
[(double *)vectp_a.124_552];
t.c:11:14: note: add new stmt: vect_ar_220.129_554 = VEC_PERM_EXPR
<vect_ar_230.127_551, vect_ar_230.127_551, { 0, 0 }>;
that's from the VMAT_CONTIGUOUS path. I have an improvement for this as well.