Richard Biener <rguent...@suse.de> writes:
> On Wed, 20 Aug 2025, Richard Sandiford wrote:
>
>> Tamar Christina <tamar.christ...@arm.com> writes:
>> > To avoid double counting scalar instructions when doing inner-loop costing 
>> > the
>> > vectorizer uses vect_prologue as the kind instead of vect_body.
>> >
>> > However doing this results in our throughput based costing to think the 
>> > scalar
>> > loop issues in 0 cycles (rounded up to 1.0).  The latency costs however 
>> > are fine
>> >
>> > note:  operating on full vectors.
>> > note:  Original vector body cost = 10
>> > note:  Vector loop iterates at most 25000 times
>> > note:  Scalar issue estimate:
>> > note:    load operations = 0
>> > note:    store operations = 0
>> > note:    general operations = 0
>> > note:    reduction latency = 0
>> > note:    estimated min cycles per iteration = 1.000000
>> > note:    estimated cycles per vector iteration (for VF 4) = 4.000000
>> > note:  Vector issue estimate:
>> > note:    load operations = 1
>> > note:    store operations = 0
>> > note:    general operations = 2
>> > note:    reduction latency = 0
>> > note:    estimated min cycles per iteration = 1.000000
>> > note:  Cost model analysis:
>> >   Vector inside of loop cost: 10
>> >   Vector prologue cost: 4
>> >   Vector epilogue cost: 0
>> >   Scalar iteration cost: 6
>> >   Scalar outside cost: 0
>> >   Vector outside cost: 4
>> >   prologue iterations: 0
>> >   epilogue iterations: 0
>> >   Calculated minimum iters for profitability: 2
>> > note:    Runtime profitability threshold = 4
>> > note:    Static estimate profitability threshold = 4
>> >
>> > This patch, changes it so that when doing inner-loop costing consider all
>> > statements for throughput costing.  Typically scalar shouldn't have an 
>> > epilogue
>> > cost but this also covers any future tweaks to the vectorizer costing 
>> > where it
>> > might be used as a similar trick for costing.  After this patch the 
>> > throughputs
>> > for scalar innner-loop vect are now correct:
>> >
>> > note:  operating on full vectors.
>> > note:  Original vector body cost = 500
>> > note:  Vector loop iterates at most 25000 times
>> > note:  Scalar issue estimate:
>> > note:    load operations = 50
>> > note:    store operations = 0
>> > note:    general operations = 50
>> > note:    reduction latency = 0
>> > note:    estimated min cycles per iteration = 16.666667
>> > note:    estimated cycles per vector iteration (for VF 4) = 66.666667
>> > note:  Vector issue estimate:
>> > note:    load operations = 1
>> > note:    store operations = 0
>> > note:    general operations = 2
>> > note:    reduction latency = 0
>> > note:    estimated min cycles per iteration = 1.000000
>> >
>> > The cost multiplier of 50 is due to a generic multiplier the vectorizer 
>> > applies
>> > to inner loop costing.
>> 
>> I don't really understand why the scalar costing code is written
>> the way that it is.  It uses the count parameter to apply a factor of
>> LOOP_VINFO_INNER_LOOP_COST_FACTOR for the inner loop, but then uses
>> vect_prologue rather than vect_body to prevent adjust_cost_for_freq from
>> applying the same factor:
>> 
>>   /* FORNOW.  */
>>   innerloop_iters = 1;
>>   if (loop->inner)
>>     innerloop_iters = LOOP_VINFO_INNER_LOOP_COST_FACTOR (loop_vinfo);
>> 
>>   for (i = 0; i < nbbs; i++)
>>     {
>>       ...
>>       basic_block bb = bbs[i];
>> 
>>       if (bb->loop_father == loop->inner)
>>         factor = innerloop_iters;
>>       else
>>         factor = 1;
>> 
>>           ...
>>        /* We are using vect_prologue here to avoid scaling twice
>>           by the inner loop factor.  */
>>        record_stmt_cost (&LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo),
>>                          factor, kind, stmt_info, 0, vect_prologue);
>> 
>> Why not just use vect_body rather than vect_prologue for the inner loop,
>> and pass 1 instead of factor for all cases?  That seems like it gives the
>> target more information and would avoid both the original problem and
>> the *50 in the throughput calculations above.
>> 
>> I fear that if we go with the patch as-is, the scalar througput costs
>> will be so inflated that we'll need to find excuses to inflate the
>> vector code by the same amount.
>
> I guess this is a left-over hack for not wanting to adjust targets.
> x86 does unconditional
>
>   /* Statements in an inner loop relative to the loop being
>      vectorized are weighted more heavily.  The value here is
>      arbitrary and could potentially be improved with analysis.  */
>   retval = adjust_cost_for_freq (stmt_info, where, count * stmt_cost);
>
> and adjust_cost_for_freq only applies to vect_body costs.

Yeah.  It looks like longarch and rs6000 also call this directly,
whereas aarch64 and riscv use the record_stmt_cost helper, which does
the same thing internally.

I suppose that itself a partial transition, in that all ports could
probably use record_stmt_cost, to make things consistent.  But that's
just cosmetic.

> I suppose
> the inner loop adjustment should be simply removed from scalar
> loop costing above, use vect_body and then leave the scale applying
> to add_stmt_cost?

Yeah, I think that would be better.  A simplification and a costing fix :)

Richard

Reply via email to