On Wed, 23 Apr 2025, Tamar Christina wrote:

> > -----Original Message-----
> > From: Richard Biener <rguent...@suse.de>
> > Sent: Wednesday, April 23, 2025 10:14 AM
> > To: Tamar Christina <tamar.christ...@arm.com>
> > Cc: Richard Sandiford <richard.sandif...@arm.com>; gcc-patches@gcc.gnu.org;
> > nd <n...@arm.com>
> > Subject: RE: [PATCH]middle-end: Add new "max" vector cost model
> > 
> > On Wed, 23 Apr 2025, Tamar Christina wrote:
> > 
> > > > -----Original Message-----
> > > > From: Richard Sandiford <richard.sandif...@arm.com>
> > > > Sent: Wednesday, April 23, 2025 9:45 AM
> > > > To: Tamar Christina <tamar.christ...@arm.com>
> > > > Cc: Richard Biener <rguent...@suse.de>; gcc-patches@gcc.gnu.org; nd
> > > > <n...@arm.com>
> > > > Subject: Re: [PATCH]middle-end: Add new "max" vector cost model
> > > >
> > > > Tamar Christina <tamar.christ...@arm.com> writes:
> > > > >> -----Original Message-----
> > > > >> From: Richard Biener <rguent...@suse.de>
> > > > >> Sent: Wednesday, April 23, 2025 9:31 AM
> > > > >> To: Tamar Christina <tamar.christ...@arm.com>
> > > > >> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Sandiford
> > > > >> <richard.sandif...@arm.com>
> > > > >> Subject: Re: [PATCH]middle-end: Add new "max" vector cost model
> > > > >>
> > > > >> On Wed, 23 Apr 2025, Tamar Christina wrote:
> > > > >>
> > > > >> > Hi All,
> > > > >> >
> > > > >> > This patch proposes a new vector cost model called "max".  The 
> > > > >> > cost model
> > is
> > > > an
> > > > >> > intersection between two of our existing cost models.  Like 
> > > > >> > `unlimited` it
> > > > >> > disables the costing vs scalar and assumes all vectorization to be 
> > > > >> > profitable.
> > > > >> >
> > > > >> > But unlike unlimited it does not fully disable the vector cost 
> > > > >> > model.  That
> > > > >> > means that we still perform comparisons between vector modes.
> > > > >> >
> > > > >> > As an example, the following:
> > > > >> >
> > > > >> > void
> > > > >> > foo (char *restrict a, int *restrict b, int *restrict c,
> > > > >> >      int *restrict d, int stride)
> > > > >> > {
> > > > >> >     if (stride <= 1)
> > > > >> >         return;
> > > > >> >
> > > > >> >     for (int i = 0; i < 3; i++)
> > > > >> >         {
> > > > >> >             int res = c[i];
> > > > >> >             int t = b[i * stride];
> > > > >> >             if (a[i] != 0)
> > > > >> >                 res = t * d[i];
> > > > >> >             c[i] = res;
> > > > >> >         }
> > > > >> > }
> > > > >> >
> > > > >> > compiled with -O3 -march=armv8-a+sve -fvect-cost-model=dynamic 
> > > > >> > fails
> > to
> > > > >> > vectorize as it assumes scalar would be faster, and with
> > > > >> > -fvect-cost-model=unlimited it picks a vector type that's so big 
> > > > >> > that the
> > large
> > > > >> > sequence generated is working on mostly inactive lanes:
> > > > >> >
> > > > >> >         ...
> > > > >> >         and     p3.b, p3/z, p4.b, p4.b
> > > > >> >         whilelo p0.s, wzr, w7
> > > > >> >         ld1w    z23.s, p3/z, [x3, #3, mul vl]
> > > > >> >         ld1w    z28.s, p0/z, [x5, z31.s, sxtw 2]
> > > > >> >         add     x0, x5, x0
> > > > >> >         punpklo p6.h, p6.b
> > > > >> >         ld1w    z27.s, p4/z, [x0, z31.s, sxtw 2]
> > > > >> >         and     p6.b, p6/z, p0.b, p0.b
> > > > >> >         punpklo p4.h, p7.b
> > > > >> >         ld1w    z24.s, p6/z, [x3, #2, mul vl]
> > > > >> >         and     p4.b, p4/z, p2.b, p2.b
> > > > >> >         uqdecw  w6
> > > > >> >         ld1w    z26.s, p4/z, [x3]
> > > > >> >         whilelo p1.s, wzr, w6
> > > > >> >         mul     z27.s, p5/m, z27.s, z23.s
> > > > >> >         ld1w    z29.s, p1/z, [x4, z31.s, sxtw 2]
> > > > >> >         punpkhi p7.h, p7.b
> > > > >> >         mul     z24.s, p5/m, z24.s, z28.s
> > > > >> >         and     p7.b, p7/z, p1.b, p1.b
> > > > >> >         mul     z26.s, p5/m, z26.s, z30.s
> > > > >> >         ld1w    z25.s, p7/z, [x3, #1, mul vl]
> > > > >> >         st1w    z27.s, p3, [x2, #3, mul vl]
> > > > >> >         mul     z25.s, p5/m, z25.s, z29.s
> > > > >> >         st1w    z24.s, p6, [x2, #2, mul vl]
> > > > >> >         st1w    z25.s, p7, [x2, #1, mul vl]
> > > > >> >         st1w    z26.s, p4, [x2]
> > > > >> >         ...
> > > > >> >
> > > > >> > With -fvect-cost-model=max you get more reasonable code:
> > > > >> >
> > > > >> > foo:
> > > > >> >         cmp     w4, 1
> > > > >> >         ble     .L1
> > > > >> >         ptrue   p7.s, vl3
> > > > >> >         index   z0.s, #0, w4
> > > > >> >         ld1b    z29.s, p7/z, [x0]
> > > > >> >         ld1w    z30.s, p7/z, [x1, z0.s, sxtw 2]
> > > > >> >    ptrue   p6.b, all
> > > > >> >         cmpne   p7.b, p7/z, z29.b, #0
> > > > >> >         ld1w    z31.s, p7/z, [x3]
> > > > >> >    mul     z31.s, p6/m, z31.s, z30.s
> > > > >> >         st1w    z31.s, p7, [x2]
> > > > >> > .L1:
> > > > >> >         ret
> > > > >> >
> > > > >> > This model has been useful internally for performance exploration 
> > > > >> > and
> > cost-
> > > > >> model
> > > > >> > validation.  It allows us to force realistic vectorization 
> > > > >> > overriding the cost
> > > > >> > model to be able to tell whether it's correct wrt to profitability.
> > > > >> >
> > > > >> > Bootstrapped Regtested on aarch64-none-linux-gnu,
> > > > >> > arm-none-linux-gnueabihf, x86_64-pc-linux-gnu
> > > > >> > -m32, -m64 and no issues.
> > > > >> >
> > > > >> > Ok for master?
> > > > >>
> > > > >> Hmm.  I don't like another cost model.  Instead how about changing
> > > > >> 'unlimited' to still iterate through vector sizes?  Cost modeling
> > > > >> is really about vector vs. scalar, not vector vs. vector which is
> > > > >> completely under target control.  Targets should provide a way
> > > > >> to limit iteration, like aarch64 has with the 
> > > > >> aarch64-autovec-preference
> > > > >> --param or x86 has with -mprefer-vector-width.
> > > > >>
> > > > >
> > > > > I'm ok with changing 'unlimited' if that's preferred, but I do want 
> > > > > to point
> > > > > out that we don't have enough control with current --param or -m 
> > > > > options
> > > > > to simulate all cases.
> > > > >
> > > > > For instance for SVE there's way for us to force a smaller type to be 
> > > > > used
> > > > > and thus force an unpacking to happen.  Or there's no way to force an
> > > > > unrolling with Adv. SIMD.
> > > > >
> > > > > Basically there's not enough control over the VF to exercise some 
> > > > > tests
> > > > > reliably.  Some tests explicitly relied on unlimited just picking the 
> > > > > first
> > > > > mode.
> > > >
> > > > FWIW, adding extra AArch64 --params sounds ok to me.  The ones we have
> > > > were just added on an as-needed/as-wanted basis, rather than as an 
> > > > attempt
> > > > to be complete.
> > > >
> > > > After the aarch64-autovec-preference backward-compatibility controversy,
> > > > we should consider whether what we add is something that is intended for
> > > > developers and can be taken away at any time (--param), or whether it's
> > > > something that we promise to support going forward (-m).
> > >
> > > Fair enough, the target doesn't have enough control over the vector 
> > > costing
> > > strategy here.  So I guess the implementation for this would be to raise 
> > > scalar
> > > costing to an extreme degree such that 'dynamic' will always vectorize and
> > > still do the inner vector mode comparisons?
> > 
> > I'd think a --param to "scale" the scalar (or vector) cost so one could
> > gradually get "more" vectorization might sound useful.
> > 
> > That said, we do want to get the chance to look at the cases where
> > vector costs are larger than scalar costs - for x86 it's most of the
> > time cases that show "fixes" we put in for the lack of proper modeling
> > a CPU pipeline bite back in some cases.
> 
> Yeah, that's one of the reason for this patch :)
> 
> So joining the three threads, it sounds like the following conclusion:
> 
> - no new cost model
> - provide parameter to scale scalar costing (I'd scale scalar costing up 
> rather
>   than vector down, otherwise the fractional costing can become iffy if the
>   values collapse to 0.)
> - provide a parameter to select VF or mode
> 
> The only remaining question now is, whether these should be backend
> parameters or generic parameters usable for all targets.

The cost scaling could be generic, the iteration thing is necessarily
target specific.

Richard.

> Thanks,
> Tamar
> 
> > 
> > Richard.
> > 
> > > Still I'm hoping we could do something generic here as I think it's 
> > > useful for
> > > everyone but will prepare a backend patch if this isn't the case.
> > >
> > > If we do, my vote would be for a `-m` option as the use cases for this 
> > > would
> > > be used in infrastructures for a long time.
> > >
> > > Thanks,
> > > Tamar
> > >
> > > >
> > > > Thanks,
> > > > Richard
> > >
> > 
> > --
> > Richard Biener <rguent...@suse.de>
> > SUSE Software Solutions Germany GmbH,
> > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Reply via email to