RE: [PATCH]middle-end: Add new "max" vector cost model

Richard Biener Wed, 23 Apr 2025 03:27:40 -0700

On Wed, 23 Apr 2025, Tamar Christina wrote:

> > -----Original Message-----
> > From: Richard Biener <rguent...@suse.de>
> > Sent: Wednesday, April 23, 2025 9:46 AM
> > To: Tamar Christina <tamar.christ...@arm.com>
> > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Sandiford
> > <richard.sandif...@arm.com>
> > Subject: RE: [PATCH]middle-end: Add new "max" vector cost model
> > 
> > On Wed, 23 Apr 2025, Tamar Christina wrote:
> > 
> > > > -----Original Message-----
> > > > From: Richard Biener <rguent...@suse.de>
> > > > Sent: Wednesday, April 23, 2025 9:37 AM
> > > > To: Tamar Christina <tamar.christ...@arm.com>
> > > > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Sandiford
> > > > <richard.sandif...@arm.com>
> > > > Subject: Re: [PATCH]middle-end: Add new "max" vector cost model
> > > >
> > > > On Wed, 23 Apr 2025, Richard Biener wrote:
> > > >
> > > > > On Wed, 23 Apr 2025, Tamar Christina wrote:
> > > > >
> > > > > > Hi All,
> > > > > >
> > > > > > This patch proposes a new vector cost model called "max".  The cost 
> > > > > > model
> > is
> > > > an
> > > > > > intersection between two of our existing cost models.  Like 
> > > > > > `unlimited` it
> > > > > > disables the costing vs scalar and assumes all vectorization to be 
> > > > > > profitable.
> > > > > >
> > > > > > But unlike unlimited it does not fully disable the vector cost 
> > > > > > model.  That
> > > > > > means that we still perform comparisons between vector modes.
> > > > > >
> > > > > > As an example, the following:
> > > > > >
> > > > > > void
> > > > > > foo (char *restrict a, int *restrict b, int *restrict c,
> > > > > >      int *restrict d, int stride)
> > > > > > {
> > > > > >     if (stride <= 1)
> > > > > >         return;
> > > > > >
> > > > > >     for (int i = 0; i < 3; i++)
> > > > > >         {
> > > > > >             int res = c[i];
> > > > > >             int t = b[i * stride];
> > > > > >             if (a[i] != 0)
> > > > > >                 res = t * d[i];
> > > > > >             c[i] = res;
> > > > > >         }
> > > > > > }
> > > > > >
> > > > > > compiled with -O3 -march=armv8-a+sve -fvect-cost-model=dynamic 
> > > > > > fails to
> > > > > > vectorize as it assumes scalar would be faster, and with
> > > > > > -fvect-cost-model=unlimited it picks a vector type that's so big 
> > > > > > that the large
> > > > > > sequence generated is working on mostly inactive lanes:
> > > > > >
> > > > > >         ...
> > > > > >         and     p3.b, p3/z, p4.b, p4.b
> > > > > >         whilelo p0.s, wzr, w7
> > > > > >         ld1w    z23.s, p3/z, [x3, #3, mul vl]
> > > > > >         ld1w    z28.s, p0/z, [x5, z31.s, sxtw 2]
> > > > > >         add     x0, x5, x0
> > > > > >         punpklo p6.h, p6.b
> > > > > >         ld1w    z27.s, p4/z, [x0, z31.s, sxtw 2]
> > > > > >         and     p6.b, p6/z, p0.b, p0.b
> > > > > >         punpklo p4.h, p7.b
> > > > > >         ld1w    z24.s, p6/z, [x3, #2, mul vl]
> > > > > >         and     p4.b, p4/z, p2.b, p2.b
> > > > > >         uqdecw  w6
> > > > > >         ld1w    z26.s, p4/z, [x3]
> > > > > >         whilelo p1.s, wzr, w6
> > > > > >         mul     z27.s, p5/m, z27.s, z23.s
> > > > > >         ld1w    z29.s, p1/z, [x4, z31.s, sxtw 2]
> > > > > >         punpkhi p7.h, p7.b
> > > > > >         mul     z24.s, p5/m, z24.s, z28.s
> > > > > >         and     p7.b, p7/z, p1.b, p1.b
> > > > > >         mul     z26.s, p5/m, z26.s, z30.s
> > > > > >         ld1w    z25.s, p7/z, [x3, #1, mul vl]
> > > > > >         st1w    z27.s, p3, [x2, #3, mul vl]
> > > > > >         mul     z25.s, p5/m, z25.s, z29.s
> > > > > >         st1w    z24.s, p6, [x2, #2, mul vl]
> > > > > >         st1w    z25.s, p7, [x2, #1, mul vl]
> > > > > >         st1w    z26.s, p4, [x2]
> > > > > >         ...
> > > > > >
> > > > > > With -fvect-cost-model=max you get more reasonable code:
> > > > > >
> > > > > > foo:
> > > > > >         cmp     w4, 1
> > > > > >         ble     .L1
> > > > > >         ptrue   p7.s, vl3
> > > > > >         index   z0.s, #0, w4
> > > > > >         ld1b    z29.s, p7/z, [x0]
> > > > > >         ld1w    z30.s, p7/z, [x1, z0.s, sxtw 2]
> > > > > >     ptrue   p6.b, all
> > > > > >         cmpne   p7.b, p7/z, z29.b, #0
> > > > > >         ld1w    z31.s, p7/z, [x3]
> > > > > >     mul     z31.s, p6/m, z31.s, z30.s
> > > > > >         st1w    z31.s, p7, [x2]
> > > > > > .L1:
> > > > > >         ret
> > > > > >
> > > > > > This model has been useful internally for performance exploration 
> > > > > > and cost-
> > > > model
> > > > > > validation.  It allows us to force realistic vectorization 
> > > > > > overriding the cost
> > > > > > model to be able to tell whether it's correct wrt to profitability.
> > > > > >
> > > > > > Bootstrapped Regtested on aarch64-none-linux-gnu,
> > > > > > arm-none-linux-gnueabihf, x86_64-pc-linux-gnu
> > > > > > -m32, -m64 and no issues.
> > > > > >
> > > > > > Ok for master?
> > > > >
> > > > > Hmm.  I don't like another cost model.  Instead how about changing
> > > > > 'unlimited' to still iterate through vector sizes?  Cost modeling
> > > > > is really about vector vs. scalar, not vector vs. vector which is
> > > > > completely under target control.  Targets should provide a way
> > > > > to limit iteration, like aarch64 has with the 
> > > > > aarch64-autovec-preference
> > > > > --param or x86 has with -mprefer-vector-width.
> > > > >
> > > > > Of course changing 'unlimited' might result in somewhat of a testsuite
> > > > > churn, but still the fix there would be to inject a proper -mXYZ
> > > > > or --param to get the old behavior back (or even consider cycling
> > > > > through the different aarch64-autovec-preference settings for the
> > > > > testsuite).
> > > >
> > > > Note this will completely remove the ability to reject never profitable
> > > > vectorizations, so I'm not sure that this is what you'd want in 
> > > > practice.
> > > > You want to fix cost modeling instead.
> > > >
> > > > So why does it consider the scalar code to be faster with =dynamic
> > > > and why do you think that's not possible to fix?  Don't we have
> > > > per-loop #pragma control to force vectorization here (but maybe that
> > > > has the 'unlimited' cost modeling issue)?
> > > >
> > >
> > > The addition wasn't for the GCC testsuite usage specifically.   This is 
> > > about
> > > testing real world code wrt to our cost models.  In these instances it's 
> > > not
> > > feasible to sprinkle pragmas over every loop in every program.
> > 
> > Sure, but still cost modeling should be fixed then - with using
> > unlimited (or max) you'd still have to sprinkle novector on the loops
> > that will be slower otherwise.
> 
> Ack, that said we compare performance at the BB level as well.  So we are able
> to tell improvements/regressions to a lower level extend than loop boundaries.
> 
> > 
> > > We also use this during uarch design validation, as e.g. it gives someone
> > > working on a CPU the ability to generate vector code for design purposes
> > > regardless of what the compiler thinks is profitable on current designs.
> > 
> > For the latter I believe the target should provide ways to force a
> > specific mode with =unlimited then, otherwise you can't reliably get
> > all variants anyway but would depend on costing to pick the correct
> > one out of a set of enabled modes.
> > 
> 
> This doesn't quite work though.  I do believe a target param to pick a mode
> or VF is useful. And at some point Andre was working on one but never finished
> It.
> 
> However such parameter is a global option, and if vectorization is not 
> possible
> with the specified mode it'll fail.
> 
> Such precise control is useful for small testcases, but not "programs".
> 
> But if I'm not misunderstanding you, you're saying you're ok with 
> changing unlimited, and to fix testsuite fallout we can add params?  
> That said isn't the ability to control the vector mode useful for 
> writing testcases for all targets?


I'm OK with changing unlimited to give the target control on how exactly
it iterates over modes or not - on a similar note we might want to change
that iteration to make the target pick the next mode to iterate to
rather than having the up-front vector of modes and the vectorizer
somehow figuring what modes are useful or not.

Richard.

> Thanks,
> Tamar
> 
> > Richard.
> > 
> > > Thanks,
> > > Tamar
> > >
> > > > Richard.
> > > >
> > > > > Richard.
> > > > >
> > > > > > Thanks,
> > > > > > Tamar
> > > > > >
> > > > > > gcc/ChangeLog:
> > > > > >
> > > > > >     * common.opt (vect-cost-model, simd-cost-model): Add max cost 
> > > > > > model.
> > > > > >     * doc/invoke.texi: Document it.
> > > > > >     * flag-types.h (enum vect_cost_model): Add VECT_COST_MODEL_MAX.
> > > > > >     * tree-vect-data-refs.cc (vect_peeling_hash_insert,
> > > > > >     vect_peeling_hash_choose_best_peeling,
> > > > > >     vect_enhance_data_refs_alignment): Use it.
> > > > > >     * tree-vect-loop.cc (vect_analyze_loop_costing,
> > > > > >     vect_estimate_min_profitable_iters): Likewise.
> > > > > >
> > > > > > ---
> > > > > > diff --git a/gcc/common.opt b/gcc/common.opt
> > > > > > index
> > > >
> > 88d987e6ab14d9f8df7aa686efffc43418dbb42d..bd5e2e951f9388b12206d9ad
> > > > dc736e336cd0e4ee 100644
> > > > > > --- a/gcc/common.opt
> > > > > > +++ b/gcc/common.opt
> > > > > > @@ -3442,11 +3442,11 @@ Enable basic block vectorization (SLP) on
> > trees.
> > > > > >
> > > > > >  fvect-cost-model=
> > > > > >  Common Joined RejectNegative Enum(vect_cost_model)
> > > > Var(flag_vect_cost_model) Init(VECT_COST_MODEL_DEFAULT) Optimization
> > > > > > --fvect-cost-model=[unlimited|dynamic|cheap|very-cheap]     
> > > > > > Specifies
> > the cost
> > > > model for vectorization.
> > > > > > +-fvect-cost-model=[unlimited|max|dynamic|cheap|very-cheap]
> >     Specifies
> > > > the cost model for vectorization.
> > > > > >
> > > > > >  fsimd-cost-model=
> > > > > >  Common Joined RejectNegative Enum(vect_cost_model)
> > > > Var(flag_simd_cost_model) Init(VECT_COST_MODEL_UNLIMITED)
> > Optimization
> > > > > > --fsimd-cost-model=[unlimited|dynamic|cheap|very-cheap]     
> > > > > > Specifies
> > > > the vectorization cost model for code marked with a simd directive.
> > > > > > +-fsimd-cost-model=[unlimited|max|dynamic|cheap|very-cheap]
> >     Specifies
> > > > the vectorization cost model for code marked with a simd directive.
> > > > > >
> > > > > >  Enum
> > > > > >  Name(vect_cost_model) Type(enum vect_cost_model)
> > > > UnknownError(unknown vectorizer cost model %qs)
> > > > > > @@ -3454,6 +3454,9 @@ Name(vect_cost_model) Type(enum
> > > > vect_cost_model) UnknownError(unknown vectorizer
> > > > > >  EnumValue
> > > > > >  Enum(vect_cost_model) String(unlimited)
> > > > Value(VECT_COST_MODEL_UNLIMITED)
> > > > > >
> > > > > > +EnumValue
> > > > > > +Enum(vect_cost_model) String(max) Value(VECT_COST_MODEL_MAX)
> > > > > > +
> > > > > >  EnumValue
> > > > > >  Enum(vect_cost_model) String(dynamic)
> > > > Value(VECT_COST_MODEL_DYNAMIC)
> > > > > >
> > > > > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> > > > > > index
> > > >
> > 14a78fd236f64185fc129f18b52b20692d49305c..e7b242c9134ff17022c92f81c
> > > > 8b24762cfd59c6c 100644
> > > > > > --- a/gcc/doc/invoke.texi
> > > > > > +++ b/gcc/doc/invoke.texi
> > > > > > @@ -14449,9 +14449,11 @@ With the @samp{unlimited} model the
> > > > vectorized code-path is assumed
> > > > > >  to be profitable while with the @samp{dynamic} model a runtime 
> > > > > > check
> > > > > >  guards the vectorized code-path to enable it only for iteration
> > > > > >  counts that will likely execute faster than when executing the 
> > > > > > original
> > > > > > -scalar loop.  The @samp{cheap} model disables vectorization of
> > > > > > -loops where doing so would be cost prohibitive for example due to
> > > > > > -required runtime checks for data dependence or alignment but 
> > > > > > otherwise
> > > > > > +scalar loop.  The @samp{max} model similarly to the 
> > > > > > @samp{unlimited}
> > model
> > > > > > +assumes all vector code is profitable over scalar within loops but 
> > > > > > does not
> > > > > > +disable the vector to vector costing.  The @samp{cheap} model 
> > > > > > disables
> > > > > > +vectorization of loops where doing so would be cost prohibitive for
> > example
> > > > due
> > > > > > +to required runtime checks for data dependence or alignment but
> > otherwise
> > > > > >  is equal to the @samp{dynamic} model.  The @samp{very-cheap} model
> > > > disables
> > > > > >  vectorization of loops when any runtime check for data dependence 
> > > > > > or
> > > > alignment
> > > > > >  is required, it also disables vectorization of epilogue loops but 
> > > > > > otherwise is
> > > > > > diff --git a/gcc/flag-types.h b/gcc/flag-types.h
> > > > > > index
> > > >
> > db573768c23d9f6809ae115e71370960314f16ce..1c941c295a2e608eae58c3e3
> > > > fb0eba1284f731ca 100644
> > > > > > --- a/gcc/flag-types.h
> > > > > > +++ b/gcc/flag-types.h
> > > > > > @@ -277,9 +277,10 @@ enum scalar_storage_order_kind {
> > > > > >  /* Vectorizer cost-model.  Except for DEFAULT, the values are 
> > > > > > ordered from
> > > > > >     the most conservative to the least conservative.  */
> > > > > >  enum vect_cost_model {
> > > > > > -  VECT_COST_MODEL_VERY_CHEAP = -3,
> > > > > > -  VECT_COST_MODEL_CHEAP = -2,
> > > > > > -  VECT_COST_MODEL_DYNAMIC = -1,
> > > > > > +  VECT_COST_MODEL_VERY_CHEAP = -4,
> > > > > > +  VECT_COST_MODEL_CHEAP = -3,
> > > > > > +  VECT_COST_MODEL_DYNAMIC = -2,
> > > > > > +  VECT_COST_MODEL_MAX = -1,
> > > > > >    VECT_COST_MODEL_UNLIMITED = 0,
> > > > > >    VECT_COST_MODEL_DEFAULT = 1
> > > > > >  };
> > > > > > diff --git a/gcc/tree-vect-data-refs.cc b/gcc/tree-vect-data-refs.cc
> > > > > > index
> > > >
> > c9395e33fcdfc7deedd979c764daae93b15abace..5c56956c2edcb76210c36b605
> > > > 26f031011c8e0c7 100644
> > > > > > --- a/gcc/tree-vect-data-refs.cc
> > > > > > +++ b/gcc/tree-vect-data-refs.cc
> > > > > > @@ -1847,7 +1847,9 @@ vect_peeling_hash_insert
> > > > (hash_table<peel_info_hasher> *peeling_htab,
> > > > > >    /* If this DR is not supported with unknown misalignment then 
> > > > > > bias
> > > > > >       this slot when the cost model is disabled.  */
> > > > > >    if (!supportable_if_not_aligned
> > > > > > -      && unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo)))
> > > > > > +      && (unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo))
> > > > > > +     || loop_cost_model (LOOP_VINFO_LOOP (loop_vinfo))
> > > > > > +           == VECT_COST_MODEL_MAX))
> > > > > >      slot->count += VECT_MAX_COST;
> > > > > >  }
> > > > > >
> > > > > > @@ -2002,7 +2004,8 @@ vect_peeling_hash_choose_best_peeling
> > > > (hash_table<peel_info_hasher> *peeling_hta
> > > > > >     res.peel_info.dr_info = NULL;
> > > > > >     res.vinfo = loop_vinfo;
> > > > > >
> > > > > > -   if (!unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo)))
> > > > > > +   if (!unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo))
> > > > > > +   && loop_cost_model (LOOP_VINFO_LOOP (loop_vinfo)) !=
> > > > VECT_COST_MODEL_MAX)
> > > > > >       {
> > > > > >         res.inside_cost = INT_MAX;
> > > > > >         res.outside_cost = INT_MAX;
> > > > > > @@ -2348,7 +2351,8 @@ vect_enhance_data_refs_alignment
> > (loop_vec_info
> > > > loop_vinfo)
> > > > > >                   We do this automatically for cost model, since we 
> > > > > > calculate
> > > > > >              cost for every peeling option.  */
> > > > > >           poly_uint64 nscalars = npeel_tmp;
> > > > > > -              if (unlimited_cost_model (LOOP_VINFO_LOOP 
> > > > > > (loop_vinfo)))
> > > > > > +              if (unlimited_cost_model (LOOP_VINFO_LOOP 
> > > > > > (loop_vinfo))
> > > > > > +             || loop_cost_model (LOOP_VINFO_LOOP (loop_vinfo))
> > ==
> > > > VECT_COST_MODEL_MAX)
> > > > > >             {
> > > > > >               poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
> > > > > >               unsigned group_size = 1;
> > > > > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> > > > > > index
> > > >
> > 958b829fa8d1ad267fbde3be915719f3a51e6a38..5f3adc257f6581850f901c774
> > > > 7771f5931df942a 100644
> > > > > > --- a/gcc/tree-vect-loop.cc
> > > > > > +++ b/gcc/tree-vect-loop.cc
> > > > > > @@ -2407,7 +2407,8 @@ vect_analyze_loop_costing (loop_vec_info
> > > > loop_vinfo,
> > > > > >                                   &min_profitable_estimate,
> > > > > >                                   suggested_unroll_factor);
> > > > > >
> > > > > > -  if (min_profitable_iters < 0)
> > > > > > +  if (min_profitable_iters < 0
> > > > > > +      && loop_cost_model (loop) != VECT_COST_MODEL_MAX)
> > > > > >      {
> > > > > >        if (dump_enabled_p ())
> > > > > >     dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > > > > > @@ -2430,7 +2431,8 @@ vect_analyze_loop_costing (loop_vec_info
> > > > loop_vinfo,
> > > > > >    LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo) = th;
> > > > > >
> > > > > >    if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> > > > > > -      && LOOP_VINFO_INT_NITERS (loop_vinfo) < th)
> > > > > > +      && LOOP_VINFO_INT_NITERS (loop_vinfo) < th
> > > > > > +      && loop_cost_model (loop) != VECT_COST_MODEL_MAX)
> > > > > >      {
> > > > > >        if (dump_enabled_p ())
> > > > > >     dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > > > > > @@ -2490,6 +2492,7 @@ vect_analyze_loop_costing (loop_vec_info
> > > > loop_vinfo,
> > > > > >     estimated_niter = likely_max_stmt_executions_int (loop);
> > > > > >      }
> > > > > >    if (estimated_niter != -1
> > > > > > +      && loop_cost_model (loop) != VECT_COST_MODEL_MAX
> > > > > >        && ((unsigned HOST_WIDE_INT) estimated_niter
> > > > > >       < MAX (th, (unsigned) min_profitable_estimate)))
> > > > > >      {
> > > > > > @@ -4638,7 +4641,7 @@ vect_estimate_min_profitable_iters
> > (loop_vec_info
> > > > loop_vinfo,
> > > > > >    vector_costs *target_cost_data = loop_vinfo->vector_costs;
> > > > > >
> > > > > >    /* Cost model disabled.  */
> > > > > > -  if (unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo)))
> > > > > > +   if (unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo)))
> > > > > >      {
> > > > > >        if (dump_enabled_p ())
> > > > > >     dump_printf_loc (MSG_NOTE, vect_location, "cost model 
> > > > > > disabled.\n");
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > > > --
> > > > Richard Biener <rguent...@suse.de>
> > > > SUSE Software Solutions Germany GmbH,
> > > > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG
> > Nuernberg)
> > >
> > 
> > --
> > Richard Biener <rguent...@suse.de>
> > SUSE Software Solutions Germany GmbH,
> > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

RE: [PATCH]middle-end: Add new "max" vector cost model

Reply via email to