Richard Biener <rguent...@suse.de> writes: > On Wed, 20 Aug 2025, Richard Sandiford wrote: > >> Tamar Christina <tamar.christ...@arm.com> writes: >> > To avoid double counting scalar instructions when doing inner-loop costing >> > the >> > vectorizer uses vect_prologue as the kind instead of vect_body. >> > >> > However doing this results in our throughput based costing to think the >> > scalar >> > loop issues in 0 cycles (rounded up to 1.0). The latency costs however >> > are fine >> > >> > note: operating on full vectors. >> > note: Original vector body cost = 10 >> > note: Vector loop iterates at most 25000 times >> > note: Scalar issue estimate: >> > note: load operations = 0 >> > note: store operations = 0 >> > note: general operations = 0 >> > note: reduction latency = 0 >> > note: estimated min cycles per iteration = 1.000000 >> > note: estimated cycles per vector iteration (for VF 4) = 4.000000 >> > note: Vector issue estimate: >> > note: load operations = 1 >> > note: store operations = 0 >> > note: general operations = 2 >> > note: reduction latency = 0 >> > note: estimated min cycles per iteration = 1.000000 >> > note: Cost model analysis: >> > Vector inside of loop cost: 10 >> > Vector prologue cost: 4 >> > Vector epilogue cost: 0 >> > Scalar iteration cost: 6 >> > Scalar outside cost: 0 >> > Vector outside cost: 4 >> > prologue iterations: 0 >> > epilogue iterations: 0 >> > Calculated minimum iters for profitability: 2 >> > note: Runtime profitability threshold = 4 >> > note: Static estimate profitability threshold = 4 >> > >> > This patch, changes it so that when doing inner-loop costing consider all >> > statements for throughput costing. Typically scalar shouldn't have an >> > epilogue >> > cost but this also covers any future tweaks to the vectorizer costing >> > where it >> > might be used as a similar trick for costing. After this patch the >> > throughputs >> > for scalar innner-loop vect are now correct: >> > >> > note: operating on full vectors. >> > note: Original vector body cost = 500 >> > note: Vector loop iterates at most 25000 times >> > note: Scalar issue estimate: >> > note: load operations = 50 >> > note: store operations = 0 >> > note: general operations = 50 >> > note: reduction latency = 0 >> > note: estimated min cycles per iteration = 16.666667 >> > note: estimated cycles per vector iteration (for VF 4) = 66.666667 >> > note: Vector issue estimate: >> > note: load operations = 1 >> > note: store operations = 0 >> > note: general operations = 2 >> > note: reduction latency = 0 >> > note: estimated min cycles per iteration = 1.000000 >> > >> > The cost multiplier of 50 is due to a generic multiplier the vectorizer >> > applies >> > to inner loop costing. >> >> I don't really understand why the scalar costing code is written >> the way that it is. It uses the count parameter to apply a factor of >> LOOP_VINFO_INNER_LOOP_COST_FACTOR for the inner loop, but then uses >> vect_prologue rather than vect_body to prevent adjust_cost_for_freq from >> applying the same factor: >> >> /* FORNOW. */ >> innerloop_iters = 1; >> if (loop->inner) >> innerloop_iters = LOOP_VINFO_INNER_LOOP_COST_FACTOR (loop_vinfo); >> >> for (i = 0; i < nbbs; i++) >> { >> ... >> basic_block bb = bbs[i]; >> >> if (bb->loop_father == loop->inner) >> factor = innerloop_iters; >> else >> factor = 1; >> >> ... >> /* We are using vect_prologue here to avoid scaling twice >> by the inner loop factor. */ >> record_stmt_cost (&LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo), >> factor, kind, stmt_info, 0, vect_prologue); >> >> Why not just use vect_body rather than vect_prologue for the inner loop, >> and pass 1 instead of factor for all cases? That seems like it gives the >> target more information and would avoid both the original problem and >> the *50 in the throughput calculations above. >> >> I fear that if we go with the patch as-is, the scalar througput costs >> will be so inflated that we'll need to find excuses to inflate the >> vector code by the same amount. > > I guess this is a left-over hack for not wanting to adjust targets. > x86 does unconditional > > /* Statements in an inner loop relative to the loop being > vectorized are weighted more heavily. The value here is > arbitrary and could potentially be improved with analysis. */ > retval = adjust_cost_for_freq (stmt_info, where, count * stmt_cost); > > and adjust_cost_for_freq only applies to vect_body costs.
Yeah. It looks like longarch and rs6000 also call this directly, whereas aarch64 and riscv use the record_stmt_cost helper, which does the same thing internally. I suppose that itself a partial transition, in that all ports could probably use record_stmt_cost, to make things consistent. But that's just cosmetic. > I suppose > the inner loop adjustment should be simply removed from scalar > loop costing above, use vect_body and then leave the scale applying > to add_stmt_cost? Yeah, I think that would be better. A simplification and a costing fix :) Richard