> -----Original Message-----
> From: Richard Biener <[email protected]>
> Sent: 06 May 2026 12:59
> To: [email protected]
> Cc: Tamar Christina <[email protected]>; [email protected];
> [email protected]
> Subject: [PATCH 2/2] [x86] adjust OMP SIMD call cost
> 
> The following adds special handling to OMP SIMD vector call costs
> which were not costed at all and for which a single simple vector
> stmt isn't appropriate.  PR125174 shows that even when AVX imposes
> more overhead (from also slightly bogus costing) than SSE, when
> there's two OMP SIMD calls involved doing less of those should trump
> that.
> 
> Bootstrap & regtest ongoing on x86_64-unknown-linux-gnu.
> 
> I've verified this resolves the observed 465.tonto regression.  I
> thought about catching all OMP SIMD vectorized stmts but then
> realized scalar costing doesn't see this yet so we'll make all
> vectorizations unprofitable.  We cannot handle all calls this
> way either, as some directly expand to native insns (popcount, etc.).
> So I fear we have to maintain a positive list.  There's 52
> 'notinbranch' SIMD declatations in glibc 2.38 on x86_64, probably
> different ones on ARM.
> 
> Also we of course have no idea about actual cost of the call
> (but it's expensive).  Nor do we have an idea of the scalar
> vs. vector cost.
> 
> But as the PR shows, doing "nothing" isn't an option, at least
> when, like on x86 there's both SSE and AVX variants and the
> surrounding code would make the SSE variant (appear) cheaper.
> 
> Any good ideas?
> 
> Otherwise I'll try to extensively cover all libm builtins
> (anticipating future SIMD-ification) in the same way, with
> same costs.

I hadn't even noticed that we weren't costing these, which would explain
some funky unrolling we've seen.

The AArch64 vector call PCS is particularly annoying with SVE since the 
predicate is
passed in p0, and so there's a bunch of shuffling of the arguments and
results like https://godbolt.org/z/cbGM5K8fn  so ncopies > 1 this way should
be made more expensive and passing ncopies makes complete sense.

This would allow us to cost the additional instructions for PCS moves.

It's somewhere on the backlog to make such ncopies > 1 get passed as
arguments to a single function. e.g. the above being cosf (vec1, vec2) is
actually really beneficial because you share the overhead of loading the
coeff tables and don't have the PCS movement.

I'll make an AArch64 version once patch 1 is committed.

Thanks,
Tamar

> 
> Thanks,
> Richard.
> 
>       PR target/125174
>       * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost):
>       Cost calls as 10 times FMA.
> ---
>  gcc/config/i386/i386.cc | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index e73c2d7f7d0..6b271ac3fca 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -26602,6 +26602,13 @@ ix86_vector_costs::add_stmt_cost (int count,
> vect_cost_for_stmt kind,
>        case CFN_MULH:
>       stmt_cost = ix86_multiplication_cost (ix86_cost, mode);
>       break;
> +      CASE_CFN_SIN:
> +      CASE_CFN_COS:
> +      CASE_CFN_EXP:
> +     stmt_cost = 10 * ix86_vec_cost (mode,
> +                                     mode == SFmode ? ix86_cost->fmass
> +                                     : ix86_cost->fmasd);
> +     break;
>        default:
>       break;
>        }
> --
> 2.51.0

Reply via email to