> On 23 Apr 2025, at 08:37, Tamar Christina <tamar.christ...@arm.com> wrote:
> 
> Hi All,
> 
> This patch proposes a new vector cost model called "max".  The cost model is 
> an
> intersection between two of our existing cost models.  Like `unlimited` it
> disables the costing vs scalar and assumes all vectorization to be profitable.
> 
> But unlike unlimited it does not fully disable the vector cost model.  That
> means that we still perform comparisons between vector modes.
> 
> As an example, the following:
> 
> void
> foo (char *restrict a, int *restrict b, int *restrict c,
>     int *restrict d, int stride)
> {
>    if (stride <= 1)
>        return;
> 
>    for (int i = 0; i < 3; i++)
>        {
>            int res = c[i];
>            int t = b[i * stride];
>            if (a[i] != 0)
>                res = t * d[i];
>            c[i] = res;
>        }
> }
> 
> compiled with -O3 -march=armv8-a+sve -fvect-cost-model=dynamic fails to
> vectorize as it assumes scalar would be faster, and with
> -fvect-cost-model=unlimited it picks a vector type that's so big that the 
> large
> sequence generated is working on mostly inactive lanes:
> 
>        ...
>        and     p3.b, p3/z, p4.b, p4.b
>        whilelo p0.s, wzr, w7
>        ld1w    z23.s, p3/z, [x3, #3, mul vl]
>        ld1w    z28.s, p0/z, [x5, z31.s, sxtw 2]
>        add     x0, x5, x0
>        punpklo p6.h, p6.b
>        ld1w    z27.s, p4/z, [x0, z31.s, sxtw 2]
>        and     p6.b, p6/z, p0.b, p0.b
>        punpklo p4.h, p7.b
>        ld1w    z24.s, p6/z, [x3, #2, mul vl]
>        and     p4.b, p4/z, p2.b, p2.b
>        uqdecw  w6
>        ld1w    z26.s, p4/z, [x3]
>        whilelo p1.s, wzr, w6
>        mul     z27.s, p5/m, z27.s, z23.s
>        ld1w    z29.s, p1/z, [x4, z31.s, sxtw 2]
>        punpkhi p7.h, p7.b
>        mul     z24.s, p5/m, z24.s, z28.s
>        and     p7.b, p7/z, p1.b, p1.b
>        mul     z26.s, p5/m, z26.s, z30.s
>        ld1w    z25.s, p7/z, [x3, #1, mul vl]
>        st1w    z27.s, p3, [x2, #3, mul vl]
>        mul     z25.s, p5/m, z25.s, z29.s
>        st1w    z24.s, p6, [x2, #2, mul vl]
>        st1w    z25.s, p7, [x2, #1, mul vl]
>        st1w    z26.s, p4, [x2]
>        ...
> 
> With -fvect-cost-model=max you get more reasonable code:
> 
> foo:
>        cmp     w4, 1
>        ble     .L1
>        ptrue   p7.s, vl3
>        index   z0.s, #0, w4
>        ld1b    z29.s, p7/z, [x0]
>        ld1w    z30.s, p7/z, [x1, z0.s, sxtw 2]
> ptrue   p6.b, all
>        cmpne   p7.b, p7/z, z29.b, #0
>        ld1w    z31.s, p7/z, [x3]
> mul     z31.s, p6/m, z31.s, z30.s
>        st1w    z31.s, p7, [x2]
> .L1:
>        ret
> 
> This model has been useful internally for performance exploration and 
> cost-model
> validation.  It allows us to force realistic vectorization overriding the cost
> model to be able to tell whether it's correct wrt to profitability.

Thanks for this, it looks really useful.


> 
> Bootstrapped Regtested on aarch64-none-linux-gnu,
> arm-none-linux-gnueabihf, x86_64-pc-linux-gnu
> -m32, -m64 and no issues.
> 
> Ok for master?
> 
> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
> * common.opt (vect-cost-model, simd-cost-model): Add max cost model.
> * doc/invoke.texi: Document it.
> * flag-types.h (enum vect_cost_model): Add VECT_COST_MODEL_MAX.
> * tree-vect-data-refs.cc (vect_peeling_hash_insert,
> vect_peeling_hash_choose_best_peeling,
> vect_enhance_data_refs_alignment): Use it.
> * tree-vect-loop.cc (vect_analyze_loop_costing,
> vect_estimate_min_profitable_iters): Likewise.
> 
> ---
> diff --git a/gcc/common.opt b/gcc/common.opt
> index 
> 88d987e6ab14d9f8df7aa686efffc43418dbb42d..bd5e2e951f9388b12206d9addc736e336cd0e4ee
>  100644
> --- a/gcc/common.opt
> +++ b/gcc/common.opt
> @@ -3442,11 +3442,11 @@ Enable basic block vectorization (SLP) on trees.
> 
> fvect-cost-model=
> Common Joined RejectNegative Enum(vect_cost_model) Var(flag_vect_cost_model) 
> Init(VECT_COST_MODEL_DEFAULT) Optimization
> --fvect-cost-model=[unlimited|dynamic|cheap|very-cheap] Specifies the cost 
> model for vectorization.
> +-fvect-cost-model=[unlimited|max|dynamic|cheap|very-cheap] Specifies the 
> cost model for vectorization.
> 
> fsimd-cost-model=
> Common Joined RejectNegative Enum(vect_cost_model) Var(flag_simd_cost_model) 
> Init(VECT_COST_MODEL_UNLIMITED) Optimization
> --fsimd-cost-model=[unlimited|dynamic|cheap|very-cheap] Specifies the 
> vectorization cost model for code marked with a simd directive.
> +-fsimd-cost-model=[unlimited|max|dynamic|cheap|very-cheap] Specifies the 
> vectorization cost model for code marked with a simd directive.
> 
> Enum
> Name(vect_cost_model) Type(enum vect_cost_model) UnknownError(unknown 
> vectorizer cost model %qs)
> @@ -3454,6 +3454,9 @@ Name(vect_cost_model) Type(enum vect_cost_model) 
> UnknownError(unknown vectorizer
> EnumValue
> Enum(vect_cost_model) String(unlimited) Value(VECT_COST_MODEL_UNLIMITED)
> 
> +EnumValue
> +Enum(vect_cost_model) String(max) Value(VECT_COST_MODEL_MAX)
> +
> EnumValue
> Enum(vect_cost_model) String(dynamic) Value(VECT_COST_MODEL_DYNAMIC)
> 
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index 
> 14a78fd236f64185fc129f18b52b20692d49305c..e7b242c9134ff17022c92f81c8b24762cfd59c6c
>  100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -14449,9 +14449,11 @@ With the @samp{unlimited} model the vectorized 
> code-path is assumed
> to be profitable while with the @samp{dynamic} model a runtime check
> guards the vectorized code-path to enable it only for iteration
> counts that will likely execute faster than when executing the original
> -scalar loop.  The @samp{cheap} model disables vectorization of
> -loops where doing so would be cost prohibitive for example due to
> -required runtime checks for data dependence or alignment but otherwise
> +scalar loop.  The @samp{max} model similarly to the @samp{unlimited} model
> +assumes all vector code is profitable over scalar within loops but does not
> +disable the vector to vector costing.  The @samp{cheap} model disables
> +vectorization of loops where doing so would be cost prohibitive for example 
> due
> +to required runtime checks for data dependence or alignment but otherwise
> is equal to the @samp{dynamic} model.  The @samp{very-cheap} model disables
> vectorization of loops when any runtime check for data dependence or alignment
> is required, it also disables vectorization of epilogue loops but otherwise is
> diff --git a/gcc/flag-types.h b/gcc/flag-types.h
> index 
> db573768c23d9f6809ae115e71370960314f16ce..1c941c295a2e608eae58c3e3fb0eba1284f731ca
>  100644
> --- a/gcc/flag-types.h
> +++ b/gcc/flag-types.h
> @@ -277,9 +277,10 @@ enum scalar_storage_order_kind {
> /* Vectorizer cost-model.  Except for DEFAULT, the values are ordered from
>    the most conservative to the least conservative.  */
> enum vect_cost_model {
> -  VECT_COST_MODEL_VERY_CHEAP = -3,
> -  VECT_COST_MODEL_CHEAP = -2,
> -  VECT_COST_MODEL_DYNAMIC = -1,
> +  VECT_COST_MODEL_VERY_CHEAP = -4,
> +  VECT_COST_MODEL_CHEAP = -3,
> +  VECT_COST_MODEL_DYNAMIC = -2,
> +  VECT_COST_MODEL_MAX = -1,
>   VECT_COST_MODEL_UNLIMITED = 0,
>   VECT_COST_MODEL_DEFAULT = 1
> };
> diff --git a/gcc/tree-vect-data-refs.cc b/gcc/tree-vect-data-refs.cc
> index 
> c9395e33fcdfc7deedd979c764daae93b15abace..5c56956c2edcb76210c36b60526f031011c8e0c7
>  100644
> --- a/gcc/tree-vect-data-refs.cc
> +++ b/gcc/tree-vect-data-refs.cc
> @@ -1847,7 +1847,9 @@ vect_peeling_hash_insert (hash_table<peel_info_hasher> 
> *peeling_htab,
>   /* If this DR is not supported with unknown misalignment then bias
>      this slot when the cost model is disabled.  */
>   if (!supportable_if_not_aligned
> -      && unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo)))
> +      && (unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo))
> +  || loop_cost_model (LOOP_VINFO_LOOP (loop_vinfo))
> + == VECT_COST_MODEL_MAX))

Nit, would it make sense to have a max_cost_model helper similar to the 
unlimited_cost_model above for consistency?

Thanks,
Kyrill


>     slot->count += VECT_MAX_COST;
> }
> 
> @@ -2002,7 +2004,8 @@ vect_peeling_hash_choose_best_peeling 
> (hash_table<peel_info_hasher> *peeling_hta
>    res.peel_info.dr_info = NULL;
>    res.vinfo = loop_vinfo;
> 
> -   if (!unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo)))
> +   if (!unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo))
> + && loop_cost_model (LOOP_VINFO_LOOP (loop_vinfo)) != VECT_COST_MODEL_MAX)
>      {
>        res.inside_cost = INT_MAX;
>        res.outside_cost = INT_MAX;
> @@ -2348,7 +2351,8 @@ vect_enhance_data_refs_alignment (loop_vec_info 
> loop_vinfo)
>                  We do this automatically for cost model, since we calculate
> cost for every peeling option.  */
>      poly_uint64 nscalars = npeel_tmp;
> -              if (unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo)))
> +              if (unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo))
> +  || loop_cost_model (LOOP_VINFO_LOOP (loop_vinfo)) == VECT_COST_MODEL_MAX)
> {
>  poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
>  unsigned group_size = 1;
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index 
> 958b829fa8d1ad267fbde3be915719f3a51e6a38..5f3adc257f6581850f901c7747771f5931df942a
>  100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -2407,7 +2407,8 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo,
>      &min_profitable_estimate,
>      suggested_unroll_factor);
> 
> -  if (min_profitable_iters < 0)
> +  if (min_profitable_iters < 0
> +      && loop_cost_model (loop) != VECT_COST_MODEL_MAX)
>     {
>       if (dump_enabled_p ())
> dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> @@ -2430,7 +2431,8 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo,
>   LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo) = th;
> 
>   if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> -      && LOOP_VINFO_INT_NITERS (loop_vinfo) < th)
> +      && LOOP_VINFO_INT_NITERS (loop_vinfo) < th
> +      && loop_cost_model (loop) != VECT_COST_MODEL_MAX)
>     {
>       if (dump_enabled_p ())
> dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> @@ -2490,6 +2492,7 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo,
> estimated_niter = likely_max_stmt_executions_int (loop);
>     }
>   if (estimated_niter != -1
> +      && loop_cost_model (loop) != VECT_COST_MODEL_MAX
>       && ((unsigned HOST_WIDE_INT) estimated_niter
>  < MAX (th, (unsigned) min_profitable_estimate)))
>     {
> @@ -4638,7 +4641,7 @@ vect_estimate_min_profitable_iters (loop_vec_info 
> loop_vinfo,
>   vector_costs *target_cost_data = loop_vinfo->vector_costs;
> 
>   /* Cost model disabled.  */
> -  if (unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo)))
> +   if (unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo)))
>     {
>       if (dump_enabled_p ())
> dump_printf_loc (MSG_NOTE, vect_location, "cost model disabled.\n");
> 
> 
> -- 
> <rb19390.patch>

Reply via email to