> -----Original Message----- > From: Richard Sandiford <[email protected]> > Sent: 06 May 2026 17:01 > To: Richard Biener <[email protected]> > Cc: [email protected]; Tamar Christina <[email protected]>; > [email protected] > Subject: Re: [PATCH 2/2] [x86] adjust OMP SIMD call cost > > Richard Biener <[email protected]> writes: > > The following adds special handling to OMP SIMD vector call costs > > which were not costed at all and for which a single simple vector > > stmt isn't appropriate. PR125174 shows that even when AVX imposes > > more overhead (from also slightly bogus costing) than SSE, when > > there's two OMP SIMD calls involved doing less of those should trump > > that. > > > > Bootstrap & regtest ongoing on x86_64-unknown-linux-gnu. > > > > I've verified this resolves the observed 465.tonto regression. I > > thought about catching all OMP SIMD vectorized stmts but then > > realized scalar costing doesn't see this yet so we'll make all > > vectorizations unprofitable. We cannot handle all calls this > > way either, as some directly expand to native insns (popcount, etc.). > > So I fear we have to maintain a positive list. There's 52 > > 'notinbranch' SIMD declatations in glibc 2.38 on x86_64, probably > > different ones on ARM. > > > > Also we of course have no idea about actual cost of the call > > (but it's expensive). Nor do we have an idea of the scalar > > vs. vector cost. > > > > But as the PR shows, doing "nothing" isn't an option, at least > > when, like on x86 there's both SSE and AVX variants and the > > surrounding code would make the SSE variant (appear) cheaper. > > > > Any good ideas? > > > > Otherwise I'll try to extensively cover all libm builtins > > (anticipating future SIMD-ification) in the same way, with > > same costs. > > Probably a daft question, but: if the assumption is that OMP SIMD calls > are very expensive (which I agree is reasonable!), then would there be > any important cases in which we'd want to pick an SSE loop with OMP SIMD > calls over an AVX loop with OMP SIMD calls?
I think this is the same as the Adv.SIMD vs SVE example in my reply https://godbolt.org/z/cbGM5K8fn SVE will always have an additional overhead because of the predicates. So when ncopies > 1 on same VL Adv.SIMD should always be faster. On different VLs it's more complicated... > > I just wonder whether comparing "number of OMP SIMD CALLs / estimated > VF" > would get us most of the way there, falling back to the current cost > comparison when the ratios are equal. I guess the problem here is it also depends on the life time of the calls. In my examples above one of the reasons it's more expensive is because of how the calls are materialized. I think the ncopies > 1 ones are more problematic over counting OMP SIMD CALLS because they artificially keep values live across the second call. i.e. in https://godbolt.org/z/cbGM5K8fn there's no reason for `z23` to be kept live because the OMP calls are marked pure and const (from what I remember) so couldn't affect memory anyway. Thanks, Tamar > > Thanks, > Richard > > > > > Thanks, > > Richard. > > > > PR target/125174 > > * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost): > > Cost calls as 10 times FMA. > > --- > > gcc/config/i386/i386.cc | 7 +++++++ > > 1 file changed, 7 insertions(+) > > > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc > > index e73c2d7f7d0..6b271ac3fca 100644 > > --- a/gcc/config/i386/i386.cc > > +++ b/gcc/config/i386/i386.cc > > @@ -26602,6 +26602,13 @@ ix86_vector_costs::add_stmt_cost (int count, > vect_cost_for_stmt kind, > > case CFN_MULH: > > stmt_cost = ix86_multiplication_cost (ix86_cost, mode); > > break; > > + CASE_CFN_SIN: > > + CASE_CFN_COS: > > + CASE_CFN_EXP: > > + stmt_cost = 10 * ix86_vec_cost (mode, > > + mode == SFmode ? ix86_cost->fmass > > + : ix86_cost->fmasd); > > + break; > > default: > > break; > > }
