> On 23 Apr 2025, at 08:37, Tamar Christina <tamar.christ...@arm.com> wrote:
>
> Hi All,
>
> This patch proposes a new vector cost model called "max". The cost model is
> an
> intersection between two of our existing cost models. Like `unlimited` it
> disables the costing vs scalar and assumes all vectorization to be profitable.
>
> But unlike unlimited it does not fully disable the vector cost model. That
> means that we still perform comparisons between vector modes.
>
> As an example, the following:
>
> void
> foo (char *restrict a, int *restrict b, int *restrict c,
> int *restrict d, int stride)
> {
> if (stride <= 1)
> return;
>
> for (int i = 0; i < 3; i++)
> {
> int res = c[i];
> int t = b[i * stride];
> if (a[i] != 0)
> res = t * d[i];
> c[i] = res;
> }
> }
>
> compiled with -O3 -march=armv8-a+sve -fvect-cost-model=dynamic fails to
> vectorize as it assumes scalar would be faster, and with
> -fvect-cost-model=unlimited it picks a vector type that's so big that the
> large
> sequence generated is working on mostly inactive lanes:
>
> ...
> and p3.b, p3/z, p4.b, p4.b
> whilelo p0.s, wzr, w7
> ld1w z23.s, p3/z, [x3, #3, mul vl]
> ld1w z28.s, p0/z, [x5, z31.s, sxtw 2]
> add x0, x5, x0
> punpklo p6.h, p6.b
> ld1w z27.s, p4/z, [x0, z31.s, sxtw 2]
> and p6.b, p6/z, p0.b, p0.b
> punpklo p4.h, p7.b
> ld1w z24.s, p6/z, [x3, #2, mul vl]
> and p4.b, p4/z, p2.b, p2.b
> uqdecw w6
> ld1w z26.s, p4/z, [x3]
> whilelo p1.s, wzr, w6
> mul z27.s, p5/m, z27.s, z23.s
> ld1w z29.s, p1/z, [x4, z31.s, sxtw 2]
> punpkhi p7.h, p7.b
> mul z24.s, p5/m, z24.s, z28.s
> and p7.b, p7/z, p1.b, p1.b
> mul z26.s, p5/m, z26.s, z30.s
> ld1w z25.s, p7/z, [x3, #1, mul vl]
> st1w z27.s, p3, [x2, #3, mul vl]
> mul z25.s, p5/m, z25.s, z29.s
> st1w z24.s, p6, [x2, #2, mul vl]
> st1w z25.s, p7, [x2, #1, mul vl]
> st1w z26.s, p4, [x2]
> ...
>
> With -fvect-cost-model=max you get more reasonable code:
>
> foo:
> cmp w4, 1
> ble .L1
> ptrue p7.s, vl3
> index z0.s, #0, w4
> ld1b z29.s, p7/z, [x0]
> ld1w z30.s, p7/z, [x1, z0.s, sxtw 2]
> ptrue p6.b, all
> cmpne p7.b, p7/z, z29.b, #0
> ld1w z31.s, p7/z, [x3]
> mul z31.s, p6/m, z31.s, z30.s
> st1w z31.s, p7, [x2]
> .L1:
> ret
>
> This model has been useful internally for performance exploration and
> cost-model
> validation. It allows us to force realistic vectorization overriding the cost
> model to be able to tell whether it's correct wrt to profitability.
Thanks for this, it looks really useful.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu,
> arm-none-linux-gnueabihf, x86_64-pc-linux-gnu
> -m32, -m64 and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
> * common.opt (vect-cost-model, simd-cost-model): Add max cost model.
> * doc/invoke.texi: Document it.
> * flag-types.h (enum vect_cost_model): Add VECT_COST_MODEL_MAX.
> * tree-vect-data-refs.cc (vect_peeling_hash_insert,
> vect_peeling_hash_choose_best_peeling,
> vect_enhance_data_refs_alignment): Use it.
> * tree-vect-loop.cc (vect_analyze_loop_costing,
> vect_estimate_min_profitable_iters): Likewise.
>
> ---
> diff --git a/gcc/common.opt b/gcc/common.opt
> index
> 88d987e6ab14d9f8df7aa686efffc43418dbb42d..bd5e2e951f9388b12206d9addc736e336cd0e4ee
> 100644
> --- a/gcc/common.opt
> +++ b/gcc/common.opt
> @@ -3442,11 +3442,11 @@ Enable basic block vectorization (SLP) on trees.
>
> fvect-cost-model=
> Common Joined RejectNegative Enum(vect_cost_model) Var(flag_vect_cost_model)
> Init(VECT_COST_MODEL_DEFAULT) Optimization
> --fvect-cost-model=[unlimited|dynamic|cheap|very-cheap] Specifies the cost
> model for vectorization.
> +-fvect-cost-model=[unlimited|max|dynamic|cheap|very-cheap] Specifies the
> cost model for vectorization.
>
> fsimd-cost-model=
> Common Joined RejectNegative Enum(vect_cost_model) Var(flag_simd_cost_model)
> Init(VECT_COST_MODEL_UNLIMITED) Optimization
> --fsimd-cost-model=[unlimited|dynamic|cheap|very-cheap] Specifies the
> vectorization cost model for code marked with a simd directive.
> +-fsimd-cost-model=[unlimited|max|dynamic|cheap|very-cheap] Specifies the
> vectorization cost model for code marked with a simd directive.
>
> Enum
> Name(vect_cost_model) Type(enum vect_cost_model) UnknownError(unknown
> vectorizer cost model %qs)
> @@ -3454,6 +3454,9 @@ Name(vect_cost_model) Type(enum vect_cost_model)
> UnknownError(unknown vectorizer
> EnumValue
> Enum(vect_cost_model) String(unlimited) Value(VECT_COST_MODEL_UNLIMITED)
>
> +EnumValue
> +Enum(vect_cost_model) String(max) Value(VECT_COST_MODEL_MAX)
> +
> EnumValue
> Enum(vect_cost_model) String(dynamic) Value(VECT_COST_MODEL_DYNAMIC)
>
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index
> 14a78fd236f64185fc129f18b52b20692d49305c..e7b242c9134ff17022c92f81c8b24762cfd59c6c
> 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -14449,9 +14449,11 @@ With the @samp{unlimited} model the vectorized
> code-path is assumed
> to be profitable while with the @samp{dynamic} model a runtime check
> guards the vectorized code-path to enable it only for iteration
> counts that will likely execute faster than when executing the original
> -scalar loop. The @samp{cheap} model disables vectorization of
> -loops where doing so would be cost prohibitive for example due to
> -required runtime checks for data dependence or alignment but otherwise
> +scalar loop. The @samp{max} model similarly to the @samp{unlimited} model
> +assumes all vector code is profitable over scalar within loops but does not
> +disable the vector to vector costing. The @samp{cheap} model disables
> +vectorization of loops where doing so would be cost prohibitive for example
> due
> +to required runtime checks for data dependence or alignment but otherwise
> is equal to the @samp{dynamic} model. The @samp{very-cheap} model disables
> vectorization of loops when any runtime check for data dependence or alignment
> is required, it also disables vectorization of epilogue loops but otherwise is
> diff --git a/gcc/flag-types.h b/gcc/flag-types.h
> index
> db573768c23d9f6809ae115e71370960314f16ce..1c941c295a2e608eae58c3e3fb0eba1284f731ca
> 100644
> --- a/gcc/flag-types.h
> +++ b/gcc/flag-types.h
> @@ -277,9 +277,10 @@ enum scalar_storage_order_kind {
> /* Vectorizer cost-model. Except for DEFAULT, the values are ordered from
> the most conservative to the least conservative. */
> enum vect_cost_model {
> - VECT_COST_MODEL_VERY_CHEAP = -3,
> - VECT_COST_MODEL_CHEAP = -2,
> - VECT_COST_MODEL_DYNAMIC = -1,
> + VECT_COST_MODEL_VERY_CHEAP = -4,
> + VECT_COST_MODEL_CHEAP = -3,
> + VECT_COST_MODEL_DYNAMIC = -2,
> + VECT_COST_MODEL_MAX = -1,
> VECT_COST_MODEL_UNLIMITED = 0,
> VECT_COST_MODEL_DEFAULT = 1
> };
> diff --git a/gcc/tree-vect-data-refs.cc b/gcc/tree-vect-data-refs.cc
> index
> c9395e33fcdfc7deedd979c764daae93b15abace..5c56956c2edcb76210c36b60526f031011c8e0c7
> 100644
> --- a/gcc/tree-vect-data-refs.cc
> +++ b/gcc/tree-vect-data-refs.cc
> @@ -1847,7 +1847,9 @@ vect_peeling_hash_insert (hash_table<peel_info_hasher>
> *peeling_htab,
> /* If this DR is not supported with unknown misalignment then bias
> this slot when the cost model is disabled. */
> if (!supportable_if_not_aligned
> - && unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo)))
> + && (unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo))
> + || loop_cost_model (LOOP_VINFO_LOOP (loop_vinfo))
> + == VECT_COST_MODEL_MAX))
Nit, would it make sense to have a max_cost_model helper similar to the
unlimited_cost_model above for consistency?
Thanks,
Kyrill
> slot->count += VECT_MAX_COST;
> }
>
> @@ -2002,7 +2004,8 @@ vect_peeling_hash_choose_best_peeling
> (hash_table<peel_info_hasher> *peeling_hta
> res.peel_info.dr_info = NULL;
> res.vinfo = loop_vinfo;
>
> - if (!unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo)))
> + if (!unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo))
> + && loop_cost_model (LOOP_VINFO_LOOP (loop_vinfo)) != VECT_COST_MODEL_MAX)
> {
> res.inside_cost = INT_MAX;
> res.outside_cost = INT_MAX;
> @@ -2348,7 +2351,8 @@ vect_enhance_data_refs_alignment (loop_vec_info
> loop_vinfo)
> We do this automatically for cost model, since we calculate
> cost for every peeling option. */
> poly_uint64 nscalars = npeel_tmp;
> - if (unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo)))
> + if (unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo))
> + || loop_cost_model (LOOP_VINFO_LOOP (loop_vinfo)) == VECT_COST_MODEL_MAX)
> {
> poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
> unsigned group_size = 1;
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index
> 958b829fa8d1ad267fbde3be915719f3a51e6a38..5f3adc257f6581850f901c7747771f5931df942a
> 100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -2407,7 +2407,8 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo,
> &min_profitable_estimate,
> suggested_unroll_factor);
>
> - if (min_profitable_iters < 0)
> + if (min_profitable_iters < 0
> + && loop_cost_model (loop) != VECT_COST_MODEL_MAX)
> {
> if (dump_enabled_p ())
> dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> @@ -2430,7 +2431,8 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo,
> LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo) = th;
>
> if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> - && LOOP_VINFO_INT_NITERS (loop_vinfo) < th)
> + && LOOP_VINFO_INT_NITERS (loop_vinfo) < th
> + && loop_cost_model (loop) != VECT_COST_MODEL_MAX)
> {
> if (dump_enabled_p ())
> dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> @@ -2490,6 +2492,7 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo,
> estimated_niter = likely_max_stmt_executions_int (loop);
> }
> if (estimated_niter != -1
> + && loop_cost_model (loop) != VECT_COST_MODEL_MAX
> && ((unsigned HOST_WIDE_INT) estimated_niter
> < MAX (th, (unsigned) min_profitable_estimate)))
> {
> @@ -4638,7 +4641,7 @@ vect_estimate_min_profitable_iters (loop_vec_info
> loop_vinfo,
> vector_costs *target_cost_data = loop_vinfo->vector_costs;
>
> /* Cost model disabled. */
> - if (unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo)))
> + if (unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo)))
> {
> if (dump_enabled_p ())
> dump_printf_loc (MSG_NOTE, vect_location, "cost model disabled.\n");
>
>
> --
> <rb19390.patch>