Richard Biener via Gcc-patches <[email protected]> writes:
> On Mon, Nov 8, 2021 at 2:06 PM Richard Sandiford
> <[email protected]> wrote:
>>
>> Richard Biener <[email protected]> writes:
>> > On Mon, Nov 8, 2021 at 11:45 AM Richard Sandiford via Gcc-patches
>> > <[email protected]> wrote:
>> >>
>> >> One of the things we want to do on AArch64 is compare vector loops
>> >> side-by-side and pick the best one. For some targets, we want this
>> >> to be based on issue rates as well as the usual latency-based costs
>> >> (at least for loops with relatively high iteration counts).
>> >>
>> >> The current approach to doing this is: when costing vectorisation
>> >> candidate A, try to guess what the other main candidate B will look
>> >> like and adjust A's latency-based cost up or down based on the likely
>> >> difference between A and B's issue rates. This effectively means
>> >> that we try to cost parts of B at the same time as A, without actually
>> >> being able to see B.
>> >
>> > I think the idea of the current costing is that compares are always
>> > to the original scalar loop (with comparing the resulting magic
>> > cost numbers) so that by means of transitivity comparing two
>> > vector variants work as well. So I'm not sure where 'variant B'
>> > comes into play here?
>>
>> E.g. A could be SVE and B could be Advanced SIMD, or A could be
>> SVE with fully-packed vectors and B could be SVE with partially
>> unpacked vectors.
>>
>> One motivating case is Neoverse V1, which is a 256-bit SVE target.
>> There, scalar code, Advanced SIMD code and SVE code have different issue
>> characteristics. SVE vector ops generally have the same latencies as
>> the corresponding Advanced SIMD ops. Advanced SIMD ops generally
>> have double the throughput of SVE ones, so that the overall bit-for-bit
>> throughputs are roughly equal. However, there are differences due to
>> predication, load/store handling, and so on, and those differences
>> should be decisive in some cases.
>>
>> For SLP, latency-based costs are fine. But for loop costs, they hide
>> too many details. What works best in practice, both for vector vs.
>> vector and vector vs. scalar, is:
>>
>> (1) compare issue rates between two loop bodies (vector vs. vector
>> or vector vs. scalar)
>> (2) if issue rates are equal for a given number of scalar iterations,
>> compare the latencies of the loop bodies, as we do now
>> (3) if both the above are equal, compare the latencies of the prologue
>> and epilogue, as we do now
>>
>> It's very difficult to combine latency and issue rate information
>> into a single per-statement integer, in such a way that the integers
>> can just be summed and compared.
>
> Yeah, so the idea I had when proposing the init_cost/add_stmt/finish_cost
> was that finish_cost would determine the issue rate cost part, for
> example if the uarch can issue 2 ops with ISA A and 4 ops with ISA B
> then finish_cost would compute the number of cycles required to
> issue all vector stmts. That yields sth like IPC the latency cost could
> be divided by. Doing that for both ISA A and ISA B would produce
> weighted latency values that can be compared? Alternatively
> you can of course simulate issue with the actual instruction latencies
> in mind and produce an overall iteration latency number.
>
> Comparing just latency or issue rate in isolation is likely not good
> enough?
In practice, comparing issue rates in isolation does seem to be what we
want as the first level of comparison. I don't think total latency /
issue rate is really a meaningful combination. It makes sense to the
extent that “lower latency == good” and “higher issue rate == good”,
but I don't think it's the case that halving the total latency is
equally valuable as doubling the issue rate. In practice the latter is
a significant win while the former might or might not be.
total latency / issue rate also runs the risk of double-counting:
reducing the number of ops decreases the total latency *and* increases
the issue rate, which could have a quadratic effect. (This is why we
don't decrease SVE costs when we think that the SVE code will issue
more quickly than the Advanced SIMD code.)
For a long-running loop, the only latencies that really matter are
for loop-carried dependencies. The issue rate information takes
those into account, in that it tracks reduction and (with later
patches) induction limiters.
Thanks,
Richard
>
>> So returning to the above, when costing an SVE loop A, we currently
>> collect on-the-side issue information about both A and the likely
>> Advanced SIMD implementation B. If B would have a higher throughput
>> than A then we multiply A's latency-based costs by:
>>
>> ceil(B throughput/A throughput)
>>
>> The problem is, we don't actually know whether B exists or what it
>> would look like (given that Advanced SIMD and SVE have different
>> features).
>>
>> In principle, we should do the same in reverse, but since SVE needs
>> fewer ops than Advanced SIMD, the latencies are usually smaller already.
>>
>> We also do something similar for the scalar code. When costing both
>> Advanced SIMD and SVE, we try to estimate the issue rate of the
>> original scalar code. If the scalar code could issue more quickly
>> than the equivalent vector code, we increase the latency-based cost
>> of the vector code to account for that.
>>
>> The idea of the patch (and others) is that we can do the (1), (2),
>> (3) comparison above directly rather than indirectly. We can also
>> use the issue information calculated while costing the scalar code,
>> rather than having to estimate the scalar issue information while
>> costing the vector code.
>>
>> >> This is needlessly indirect and complex. It was a compromise due
>> >> to the code being added (too) late in the GCC 11 cycle, so that
>> >> target-independent changes weren't possible.
>> >>
>> >> The target-independent code already compares two candidate loop_vec_infos
>> >> side-by-side, so that information about A and B above are available
>> >> directly. This patch creates a way for targets to hook into this
>> >> comparison.
>> >
>> > You mean it has both loop_vec_infos. But as said I'm not sure it's a good
>> > idea to compare those side-by-side - will that not lead to _more_
>> > special-casing
>> > since you need to have a working compare to the scalar variant as well
>> > to determine the vectorization threshold?
>>
>> A later patch allows the code to do the same comparison for vector
>> vs. scalar. It makes the target code significantly simpler compared
>> to now, since add_stmt_cost only needs to consider the latency and
>> issue rate of the code it's actually supposed to be costing, rather than
>> trying also to estimate the issue rate of the scalar code and the issue
>> rate of the Advanced SIMD code.
>>
>> Thanks,
>> Richard
>>
>> >
>> >>
>> >> The AArch64 code can therefore hook into better_main_loop_than_p to
>> >> compare issue rates. If the issue rate comparison isn't decisive,
>> >> the code can fall back to the normal latency-based comparison instead.
>> >>
>> >> Tested on aarch64-linux-gnu and x86_64-linux-gnu. OK to install?
>> >>
>> >> Richard
>> >>
>> >>
>> >> gcc/
>> >> * tree-vectorizer.h (vector_costs::better_main_loop_than_p)
>> >> (vector_costs::better_epilogue_loop_than_p)
>> >> (vector_costs::compare_inside_loop_cost)
>> >> (vector_costs::compare_outside_loop_cost): Likewise.
>> >> * tree-vectorizer.c (vector_costs::better_main_loop_than_p)
>> >> (vector_costs::better_epilogue_loop_than_p)
>> >> (vector_costs::compare_inside_loop_cost)
>> >> (vector_costs::compare_outside_loop_cost): New functions,
>> >> containing code moved from...
>> >> * tree-vect-loop.c (vect_better_loop_vinfo_p): ...here.
>> >> ---
>> >> gcc/tree-vect-loop.c | 142 ++---------------------------
>> >> gcc/tree-vectorizer.c | 204 ++++++++++++++++++++++++++++++++++++++++++
>> >> gcc/tree-vectorizer.h | 17 ++++
>> >> 3 files changed, 226 insertions(+), 137 deletions(-)
>> >>
>> >> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
>> >> index dd4a363fee5..c9ee2e15e35 100644
>> >> --- a/gcc/tree-vect-loop.c
>> >> +++ b/gcc/tree-vect-loop.c
>> >> @@ -2784,144 +2784,12 @@ vect_better_loop_vinfo_p (loop_vec_info
>> >> new_loop_vinfo,
>> >> return new_simdlen_p;
>> >> }
>> >>
>> >> - loop_vec_info main_loop = LOOP_VINFO_ORIG_LOOP_INFO (old_loop_vinfo);
>> >> - if (main_loop)
>> >> - {
>> >> - poly_uint64 main_poly_vf = LOOP_VINFO_VECT_FACTOR (main_loop);
>> >> - unsigned HOST_WIDE_INT main_vf;
>> >> - unsigned HOST_WIDE_INT old_factor, new_factor, old_cost, new_cost;
>> >> - /* If we can determine how many iterations are left for the
>> >> epilogue
>> >> - loop, that is if both the main loop's vectorization factor and
>> >> number
>> >> - of iterations are constant, then we use them to calculate the
>> >> cost of
>> >> - the epilogue loop together with a 'likely value' for the
>> >> epilogues
>> >> - vectorization factor. Otherwise we use the main loop's
>> >> vectorization
>> >> - factor and the maximum poly value for the epilogue's. If the
>> >> target
>> >> - has not provided with a sensible upper bound poly vectorization
>> >> - factors are likely to be favored over constant ones. */
>> >> - if (main_poly_vf.is_constant (&main_vf)
>> >> - && LOOP_VINFO_NITERS_KNOWN_P (main_loop))
>> >> - {
>> >> - unsigned HOST_WIDE_INT niters
>> >> - = LOOP_VINFO_INT_NITERS (main_loop) % main_vf;
>> >> - HOST_WIDE_INT old_likely_vf
>> >> - = estimated_poly_value (old_vf, POLY_VALUE_LIKELY);
>> >> - HOST_WIDE_INT new_likely_vf
>> >> - = estimated_poly_value (new_vf, POLY_VALUE_LIKELY);
>> >> -
>> >> - /* If the epilogue is using partial vectors we account for the
>> >> - partial iteration here too. */
>> >> - old_factor = niters / old_likely_vf;
>> >> - if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (old_loop_vinfo)
>> >> - && niters % old_likely_vf != 0)
>> >> - old_factor++;
>> >> -
>> >> - new_factor = niters / new_likely_vf;
>> >> - if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (new_loop_vinfo)
>> >> - && niters % new_likely_vf != 0)
>> >> - new_factor++;
>> >> - }
>> >> - else
>> >> - {
>> >> - unsigned HOST_WIDE_INT main_vf_max
>> >> - = estimated_poly_value (main_poly_vf, POLY_VALUE_MAX);
>> >> -
>> >> - old_factor = main_vf_max / estimated_poly_value (old_vf,
>> >> -
>> >> POLY_VALUE_MAX);
>> >> - new_factor = main_vf_max / estimated_poly_value (new_vf,
>> >> -
>> >> POLY_VALUE_MAX);
>> >> -
>> >> - /* If the loop is not using partial vectors then it will
>> >> iterate one
>> >> - time less than one that does. It is safe to subtract one
>> >> here,
>> >> - because the main loop's vf is always at least 2x bigger than
>> >> that
>> >> - of an epilogue. */
>> >> - if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (old_loop_vinfo))
>> >> - old_factor -= 1;
>> >> - if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (new_loop_vinfo))
>> >> - new_factor -= 1;
>> >> - }
>> >> -
>> >> - /* Compute the costs by multiplying the inside costs with the
>> >> factor and
>> >> - add the outside costs for a more complete picture. The factor
>> >> is the
>> >> - amount of times we are expecting to iterate this epilogue. */
>> >> - old_cost = old_loop_vinfo->vector_costs->body_cost () * old_factor;
>> >> - new_cost = new_loop_vinfo->vector_costs->body_cost () * new_factor;
>> >> - old_cost += old_loop_vinfo->vector_costs->outside_cost ();
>> >> - new_cost += new_loop_vinfo->vector_costs->outside_cost ();
>> >> - return new_cost < old_cost;
>> >> - }
>> >> -
>> >> - /* Limit the VFs to what is likely to be the maximum number of
>> >> iterations,
>> >> - to handle cases in which at least one loop_vinfo is fully-masked.
>> >> */
>> >> - HOST_WIDE_INT estimated_max_niter = likely_max_stmt_executions_int
>> >> (loop);
>> >> - if (estimated_max_niter != -1)
>> >> - {
>> >> - if (known_le (estimated_max_niter, new_vf))
>> >> - new_vf = estimated_max_niter;
>> >> - if (known_le (estimated_max_niter, old_vf))
>> >> - old_vf = estimated_max_niter;
>> >> - }
>> >> -
>> >> - /* Check whether the (fractional) cost per scalar iteration is lower
>> >> - or higher: new_inside_cost / new_vf vs. old_inside_cost / old_vf.
>> >> */
>> >> - poly_int64 rel_new = new_loop_vinfo->vector_costs->body_cost () *
>> >> old_vf;
>> >> - poly_int64 rel_old = old_loop_vinfo->vector_costs->body_cost () *
>> >> new_vf;
>> >> -
>> >> - HOST_WIDE_INT est_rel_new_min
>> >> - = estimated_poly_value (rel_new, POLY_VALUE_MIN);
>> >> - HOST_WIDE_INT est_rel_new_max
>> >> - = estimated_poly_value (rel_new, POLY_VALUE_MAX);
>> >> -
>> >> - HOST_WIDE_INT est_rel_old_min
>> >> - = estimated_poly_value (rel_old, POLY_VALUE_MIN);
>> >> - HOST_WIDE_INT est_rel_old_max
>> >> - = estimated_poly_value (rel_old, POLY_VALUE_MAX);
>> >> -
>> >> - /* Check first if we can make out an unambigous total order from the
>> >> minimum
>> >> - and maximum estimates. */
>> >> - if (est_rel_new_min < est_rel_old_min
>> >> - && est_rel_new_max < est_rel_old_max)
>> >> - return true;
>> >> - else if (est_rel_old_min < est_rel_new_min
>> >> - && est_rel_old_max < est_rel_new_max)
>> >> - return false;
>> >> - /* When old_loop_vinfo uses a variable vectorization factor,
>> >> - we know that it has a lower cost for at least one runtime VF.
>> >> - However, we don't know how likely that VF is.
>> >> -
>> >> - One option would be to compare the costs for the estimated VFs.
>> >> - The problem is that that can put too much pressure on the cost
>> >> - model. E.g. if the estimated VF is also the lowest possible VF,
>> >> - and if old_loop_vinfo is 1 unit worse than new_loop_vinfo
>> >> - for the estimated VF, we'd then choose new_loop_vinfo even
>> >> - though (a) new_loop_vinfo might not actually be better than
>> >> - old_loop_vinfo for that VF and (b) it would be significantly
>> >> - worse at larger VFs.
>> >> -
>> >> - Here we go for a hacky compromise: pick new_loop_vinfo if it is
>> >> - no more expensive than old_loop_vinfo even after doubling the
>> >> - estimated old_loop_vinfo VF. For all but trivial loops, this
>> >> - ensures that we only pick new_loop_vinfo if it is significantly
>> >> - better than old_loop_vinfo at the estimated VF. */
>> >> -
>> >> - if (est_rel_old_min != est_rel_new_min
>> >> - || est_rel_old_max != est_rel_new_max)
>> >> - {
>> >> - HOST_WIDE_INT est_rel_new_likely
>> >> - = estimated_poly_value (rel_new, POLY_VALUE_LIKELY);
>> >> - HOST_WIDE_INT est_rel_old_likely
>> >> - = estimated_poly_value (rel_old, POLY_VALUE_LIKELY);
>> >> -
>> >> - return est_rel_new_likely * 2 <= est_rel_old_likely;
>> >> - }
>> >> -
>> >> - /* If there's nothing to choose between the loop bodies, see whether
>> >> - there's a difference in the prologue and epilogue costs. */
>> >> - auto old_outside_cost = old_loop_vinfo->vector_costs->outside_cost ();
>> >> - auto new_outside_cost = new_loop_vinfo->vector_costs->outside_cost ();
>> >> - if (new_outside_cost != old_outside_cost)
>> >> - return new_outside_cost < old_outside_cost;
>> >> + const auto *old_costs = old_loop_vinfo->vector_costs;
>> >> + const auto *new_costs = new_loop_vinfo->vector_costs;
>> >> + if (loop_vec_info main_loop = LOOP_VINFO_ORIG_LOOP_INFO
>> >> (old_loop_vinfo))
>> >> + return new_costs->better_epilogue_loop_than_p (old_costs, main_loop);
>> >>
>> >> - return false;
>> >> + return new_costs->better_main_loop_than_p (old_costs);
>> >> }
>> >>
>> >> /* Decide whether to replace OLD_LOOP_VINFO with NEW_LOOP_VINFO. Return
>> >> diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
>> >> index 9ef76ce654b..dcbb2a3f13a 100644
>> >> --- a/gcc/tree-vectorizer.c
>> >> +++ b/gcc/tree-vectorizer.c
>> >> @@ -1744,3 +1744,207 @@ vector_costs::adjust_cost_for_freq (stmt_vec_info
>> >> stmt_info,
>> >> }
>> >> return cost;
>> >> }
>> >> +
>> >> +/* See the comment above the declaration for details. */
>> >> +
>> >> +bool
>> >> +vector_costs::better_main_loop_than_p (const vector_costs *other) const
>> >> +{
>> >> + int diff = compare_inside_loop_cost (other);
>> >> + if (diff != 0)
>> >> + return diff < 0;
>> >> +
>> >> + /* If there's nothing to choose between the loop bodies, see whether
>> >> + there's a difference in the prologue and epilogue costs. */
>> >> + diff = compare_outside_loop_cost (other);
>> >> + if (diff != 0)
>> >> + return diff < 0;
>> >> +
>> >> + return false;
>> >> +}
>> >> +
>> >> +
>> >> +/* See the comment above the declaration for details. */
>> >> +
>> >> +bool
>> >> +vector_costs::better_epilogue_loop_than_p (const vector_costs *other,
>> >> + loop_vec_info main_loop) const
>> >> +{
>> >> + loop_vec_info this_loop_vinfo = as_a<loop_vec_info> (this->m_vinfo);
>> >> + loop_vec_info other_loop_vinfo = as_a<loop_vec_info> (other->m_vinfo);
>> >> +
>> >> + poly_int64 this_vf = LOOP_VINFO_VECT_FACTOR (this_loop_vinfo);
>> >> + poly_int64 other_vf = LOOP_VINFO_VECT_FACTOR (other_loop_vinfo);
>> >> +
>> >> + poly_uint64 main_poly_vf = LOOP_VINFO_VECT_FACTOR (main_loop);
>> >> + unsigned HOST_WIDE_INT main_vf;
>> >> + unsigned HOST_WIDE_INT other_factor, this_factor, other_cost,
>> >> this_cost;
>> >> + /* If we can determine how many iterations are left for the epilogue
>> >> + loop, that is if both the main loop's vectorization factor and
>> >> number
>> >> + of iterations are constant, then we use them to calculate the cost
>> >> of
>> >> + the epilogue loop together with a 'likely value' for the epilogues
>> >> + vectorization factor. Otherwise we use the main loop's
>> >> vectorization
>> >> + factor and the maximum poly value for the epilogue's. If the target
>> >> + has not provided with a sensible upper bound poly vectorization
>> >> + factors are likely to be favored over constant ones. */
>> >> + if (main_poly_vf.is_constant (&main_vf)
>> >> + && LOOP_VINFO_NITERS_KNOWN_P (main_loop))
>> >> + {
>> >> + unsigned HOST_WIDE_INT niters
>> >> + = LOOP_VINFO_INT_NITERS (main_loop) % main_vf;
>> >> + HOST_WIDE_INT other_likely_vf
>> >> + = estimated_poly_value (other_vf, POLY_VALUE_LIKELY);
>> >> + HOST_WIDE_INT this_likely_vf
>> >> + = estimated_poly_value (this_vf, POLY_VALUE_LIKELY);
>> >> +
>> >> + /* If the epilogue is using partial vectors we account for the
>> >> + partial iteration here too. */
>> >> + other_factor = niters / other_likely_vf;
>> >> + if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (other_loop_vinfo)
>> >> + && niters % other_likely_vf != 0)
>> >> + other_factor++;
>> >> +
>> >> + this_factor = niters / this_likely_vf;
>> >> + if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (this_loop_vinfo)
>> >> + && niters % this_likely_vf != 0)
>> >> + this_factor++;
>> >> + }
>> >> + else
>> >> + {
>> >> + unsigned HOST_WIDE_INT main_vf_max
>> >> + = estimated_poly_value (main_poly_vf, POLY_VALUE_MAX);
>> >> +
>> >> + other_factor = main_vf_max / estimated_poly_value (other_vf,
>> >> + POLY_VALUE_MAX);
>> >> + this_factor = main_vf_max / estimated_poly_value (this_vf,
>> >> + POLY_VALUE_MAX);
>> >> +
>> >> + /* If the loop is not using partial vectors then it will iterate
>> >> one
>> >> + time less than one that does. It is safe to subtract one here,
>> >> + because the main loop's vf is always at least 2x bigger than that
>> >> + of an epilogue. */
>> >> + if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (other_loop_vinfo))
>> >> + other_factor -= 1;
>> >> + if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (this_loop_vinfo))
>> >> + this_factor -= 1;
>> >> + }
>> >> +
>> >> + /* Compute the costs by multiplying the inside costs with the factor
>> >> and
>> >> + add the outside costs for a more complete picture. The factor is
>> >> the
>> >> + amount of times we are expecting to iterate this epilogue. */
>> >> + other_cost = other->body_cost () * other_factor;
>> >> + this_cost = this->body_cost () * this_factor;
>> >> + other_cost += other->outside_cost ();
>> >> + this_cost += this->outside_cost ();
>> >> + return this_cost < other_cost;
>> >> +}
>> >> +
>> >> +/* A <=>-style subroutine of better_main_loop_than_p. Check whether we
>> >> can
>> >> + determine the return value of better_main_loop_than_p by comparing the
>> >> + inside (loop body) costs of THIS and OTHER. Return:
>> >> +
>> >> + * -1 if better_main_loop_than_p should return true.
>> >> + * 1 if better_main_loop_than_p should return false.
>> >> + * 0 if we can't decide. */
>> >> +
>> >> +int
>> >> +vector_costs::compare_inside_loop_cost (const vector_costs *other) const
>> >> +{
>> >> + loop_vec_info this_loop_vinfo = as_a<loop_vec_info> (this->m_vinfo);
>> >> + loop_vec_info other_loop_vinfo = as_a<loop_vec_info> (other->m_vinfo);
>> >> +
>> >> + struct loop *loop = LOOP_VINFO_LOOP (this_loop_vinfo);
>> >> + gcc_assert (LOOP_VINFO_LOOP (other_loop_vinfo) == loop);
>> >> +
>> >> + poly_int64 this_vf = LOOP_VINFO_VECT_FACTOR (this_loop_vinfo);
>> >> + poly_int64 other_vf = LOOP_VINFO_VECT_FACTOR (other_loop_vinfo);
>> >> +
>> >> + /* Limit the VFs to what is likely to be the maximum number of
>> >> iterations,
>> >> + to handle cases in which at least one loop_vinfo is fully-masked.
>> >> */
>> >> + HOST_WIDE_INT estimated_max_niter = likely_max_stmt_executions_int
>> >> (loop);
>> >> + if (estimated_max_niter != -1)
>> >> + {
>> >> + if (known_le (estimated_max_niter, this_vf))
>> >> + this_vf = estimated_max_niter;
>> >> + if (known_le (estimated_max_niter, other_vf))
>> >> + other_vf = estimated_max_niter;
>> >> + }
>> >> +
>> >> + /* Check whether the (fractional) cost per scalar iteration is lower or
>> >> + higher: this_inside_cost / this_vf vs. other_inside_cost /
>> >> other_vf. */
>> >> + poly_int64 rel_this = this_loop_vinfo->vector_costs->body_cost () *
>> >> other_vf;
>> >> + poly_int64 rel_other
>> >> + = other_loop_vinfo->vector_costs->body_cost () * this_vf;
>> >> +
>> >> + HOST_WIDE_INT est_rel_this_min
>> >> + = estimated_poly_value (rel_this, POLY_VALUE_MIN);
>> >> + HOST_WIDE_INT est_rel_this_max
>> >> + = estimated_poly_value (rel_this, POLY_VALUE_MAX);
>> >> +
>> >> + HOST_WIDE_INT est_rel_other_min
>> >> + = estimated_poly_value (rel_other, POLY_VALUE_MIN);
>> >> + HOST_WIDE_INT est_rel_other_max
>> >> + = estimated_poly_value (rel_other, POLY_VALUE_MAX);
>> >> +
>> >> + /* Check first if we can make out an unambigous total order from the
>> >> minimum
>> >> + and maximum estimates. */
>> >> + if (est_rel_this_min < est_rel_other_min
>> >> + && est_rel_this_max < est_rel_other_max)
>> >> + return -1;
>> >> +
>> >> + if (est_rel_other_min < est_rel_this_min
>> >> + && est_rel_other_max < est_rel_this_max)
>> >> + return 1;
>> >> +
>> >> + /* When other_loop_vinfo uses a variable vectorization factor,
>> >> + we know that it has a lower cost for at least one runtime VF.
>> >> + However, we don't know how likely that VF is.
>> >> +
>> >> + One option would be to compare the costs for the estimated VFs.
>> >> + The problem is that that can put too much pressure on the cost
>> >> + model. E.g. if the estimated VF is also the lowest possible VF,
>> >> + and if other_loop_vinfo is 1 unit worse than this_loop_vinfo
>> >> + for the estimated VF, we'd then choose this_loop_vinfo even
>> >> + though (a) this_loop_vinfo might not actually be better than
>> >> + other_loop_vinfo for that VF and (b) it would be significantly
>> >> + worse at larger VFs.
>> >> +
>> >> + Here we go for a hacky compromise: pick this_loop_vinfo if it is
>> >> + no more expensive than other_loop_vinfo even after doubling the
>> >> + estimated other_loop_vinfo VF. For all but trivial loops, this
>> >> + ensures that we only pick this_loop_vinfo if it is significantly
>> >> + better than other_loop_vinfo at the estimated VF. */
>> >> + if (est_rel_other_min != est_rel_this_min
>> >> + || est_rel_other_max != est_rel_this_max)
>> >> + {
>> >> + HOST_WIDE_INT est_rel_this_likely
>> >> + = estimated_poly_value (rel_this, POLY_VALUE_LIKELY);
>> >> + HOST_WIDE_INT est_rel_other_likely
>> >> + = estimated_poly_value (rel_other, POLY_VALUE_LIKELY);
>> >> +
>> >> + return est_rel_this_likely * 2 <= est_rel_other_likely ? -1 : 1;
>> >> + }
>> >> +
>> >> + return 0;
>> >> +}
>> >> +
>> >> +/* A <=>-style subroutine of better_main_loop_than_p, used when there is
>> >> + nothing to choose between the inside (loop body) costs of THIS and
>> >> OTHER.
>> >> + Check whether we can determine the return value of
>> >> better_main_loop_than_p
>> >> + by comparing the outside (prologue and epilogue) costs of THIS and
>> >> OTHER.
>> >> + Return:
>> >> +
>> >> + * -1 if better_main_loop_than_p should return true.
>> >> + * 1 if better_main_loop_than_p should return false.
>> >> + * 0 if we can't decide. */
>> >> +
>> >> +int
>> >> +vector_costs::compare_outside_loop_cost (const vector_costs *other) const
>> >> +{
>> >> + auto this_outside_cost = this->outside_cost ();
>> >> + auto other_outside_cost = other->outside_cost ();
>> >> + if (this_outside_cost != other_outside_cost)
>> >> + return this_outside_cost < other_outside_cost ? -1 : 1;
>> >> +
>> >> + return 0;
>> >> +}
>> >> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
>> >> index 87d3f211a2e..0e3aad590e8 100644
>> >> --- a/gcc/tree-vectorizer.h
>> >> +++ b/gcc/tree-vectorizer.h
>> >> @@ -1419,6 +1419,21 @@ public:
>> >> read back using the functions below. */
>> >> virtual void finish_cost ();
>> >>
>> >> + /* The costs in THIS and OTHER both describe ways of vectorizing
>> >> + a main loop. Return true if the costs described by THIS are
>> >> + cheaper than the costs described by OTHER. Return false if any
>> >> + of the following are true:
>> >> +
>> >> + - THIS and OTHER are of equal cost
>> >> + - OTHER is better than THIS
>> >> + - we can't be sure about the relative costs of THIS and OTHER. */
>> >> + virtual bool better_main_loop_than_p (const vector_costs *other) const;
>> >> +
>> >> + /* Likewise, but the costs in THIS and OTHER both describe ways of
>> >> + vectorizing an epilogue loop of MAIN_LOOP. */
>> >> + virtual bool better_epilogue_loop_than_p (const vector_costs *other,
>> >> + loop_vec_info main_loop)
>> >> const;
>> >> +
>> >> unsigned int prologue_cost () const;
>> >> unsigned int body_cost () const;
>> >> unsigned int epilogue_cost () const;
>> >> @@ -1429,6 +1444,8 @@ protected:
>> >> unsigned int);
>> >> unsigned int adjust_cost_for_freq (stmt_vec_info,
>> >> vect_cost_model_location,
>> >> unsigned int);
>> >> + int compare_inside_loop_cost (const vector_costs *) const;
>> >> + int compare_outside_loop_cost (const vector_costs *) const;
>> >>
>> >> /* The region of code that we're considering vectorizing. */
>> >> vec_info *m_vinfo;
>> >> --
>> >> 2.25.1
>> >>