Tamar Christina <tamar.christ...@arm.com> writes: > To avoid double counting scalar instructions when doing inner-loop costing the > vectorizer uses vect_prologue as the kind instead of vect_body. > > However doing this results in our throughput based costing to think the scalar > loop issues in 0 cycles (rounded up to 1.0). The latency costs however are > fine > > note: operating on full vectors. > note: Original vector body cost = 10 > note: Vector loop iterates at most 25000 times > note: Scalar issue estimate: > note: load operations = 0 > note: store operations = 0 > note: general operations = 0 > note: reduction latency = 0 > note: estimated min cycles per iteration = 1.000000 > note: estimated cycles per vector iteration (for VF 4) = 4.000000 > note: Vector issue estimate: > note: load operations = 1 > note: store operations = 0 > note: general operations = 2 > note: reduction latency = 0 > note: estimated min cycles per iteration = 1.000000 > note: Cost model analysis: > Vector inside of loop cost: 10 > Vector prologue cost: 4 > Vector epilogue cost: 0 > Scalar iteration cost: 6 > Scalar outside cost: 0 > Vector outside cost: 4 > prologue iterations: 0 > epilogue iterations: 0 > Calculated minimum iters for profitability: 2 > note: Runtime profitability threshold = 4 > note: Static estimate profitability threshold = 4 > > This patch, changes it so that when doing inner-loop costing consider all > statements for throughput costing. Typically scalar shouldn't have an > epilogue > cost but this also covers any future tweaks to the vectorizer costing where it > might be used as a similar trick for costing. After this patch the > throughputs > for scalar innner-loop vect are now correct: > > note: operating on full vectors. > note: Original vector body cost = 500 > note: Vector loop iterates at most 25000 times > note: Scalar issue estimate: > note: load operations = 50 > note: store operations = 0 > note: general operations = 50 > note: reduction latency = 0 > note: estimated min cycles per iteration = 16.666667 > note: estimated cycles per vector iteration (for VF 4) = 66.666667 > note: Vector issue estimate: > note: load operations = 1 > note: store operations = 0 > note: general operations = 2 > note: reduction latency = 0 > note: estimated min cycles per iteration = 1.000000 > > The cost multiplier of 50 is due to a generic multiplier the vectorizer > applies > to inner loop costing.
I don't really understand why the scalar costing code is written the way that it is. It uses the count parameter to apply a factor of LOOP_VINFO_INNER_LOOP_COST_FACTOR for the inner loop, but then uses vect_prologue rather than vect_body to prevent adjust_cost_for_freq from applying the same factor: /* FORNOW. */ innerloop_iters = 1; if (loop->inner) innerloop_iters = LOOP_VINFO_INNER_LOOP_COST_FACTOR (loop_vinfo); for (i = 0; i < nbbs; i++) { ... basic_block bb = bbs[i]; if (bb->loop_father == loop->inner) factor = innerloop_iters; else factor = 1; ... /* We are using vect_prologue here to avoid scaling twice by the inner loop factor. */ record_stmt_cost (&LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo), factor, kind, stmt_info, 0, vect_prologue); Why not just use vect_body rather than vect_prologue for the inner loop, and pass 1 instead of factor for all cases? That seems like it gives the target more information and would avoid both the original problem and the *50 in the throughput calculations above. I fear that if we go with the patch as-is, the scalar througput costs will be so inflated that we'll need to find excuses to inflate the vector code by the same amount. Thanks, Richard > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. > > Ok for master? > > Thanks, > Tamar > > gcc/ChangeLog: > > PR target/121290 > * config/aarch64/aarch64.cc (aarch64_vector_costs::add_stmt_cost): > Include prologue costing when costing inner loop scalar stmts. > > gcc/testsuite/ChangeLog: > > PR target/121290 > * gcc.target/aarch64/pr121290.c: New test. > > --- > > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc > index > fb8311b655d7300ce22215789437da777b3779ed..1278d563d5efa08b6a9871f69b63f0786877bf99 > 100644 > --- a/gcc/config/aarch64/aarch64.cc > +++ b/gcc/config/aarch64/aarch64.cc > @@ -18060,7 +18060,14 @@ aarch64_vector_costs::add_stmt_cost (int count, > vect_cost_for_stmt kind, > fractional_cost stmt_cost > = aarch64_builtin_vectorization_cost (kind, vectype, misalign); > > - bool in_inner_loop_p = (where == vect_body > + /* When costing for scalars the vectorizer passes them as vect_prologue > cost > + instead of vect_body to avoid double counting of scaled costs. But > because > + they are passed as vect_prologue our inner loop detection thinks that > all > + the loop statements are outside the inner loop. As such the issue rate > + information for inner loops become zero for scalars. Therefore accept > any > + scalar statement, relardless of the where as part of the innerloop > + body. */ > + bool in_inner_loop_p = ((where == vect_body || m_costing_for_scalar) > && stmt_info > && stmt_in_inner_loop_p (m_vinfo, stmt_info)); > > diff --git a/gcc/testsuite/gcc.target/aarch64/pr121290.c > b/gcc/testsuite/gcc.target/aarch64/pr121290.c > new file mode 100644 > index > 0000000000000000000000000000000000000000..c88cb7e6979ef43ebaf14c3d3f3c4c561bff3e76 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/pr121290.c > @@ -0,0 +1,17 @@ > +/* { dg-do compile } */ > +/* { dg-additional-options "-O3 -mcpu=neoverse-v2 -fdump-tree-vect-all > -std=c99" } */ > + > +void > +f (int *restrict x, int *restrict y, int *restrict z, int n) > +{ > + for (int i = 0; i < 4; ++i) > + { > + int res = 0; > + for (int j = 0; j < 100; ++j) > + res += y[j] * z[i]; > + x[i] = res; > + } > +} > + > +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ > +/* { dg-final { scan-tree-dump-not "load operations = 0" "vect" } } */