Tamar Christina <[email protected]> writes: >> -----Original Message----- >> From: Richard Sandiford <[email protected]> >> Sent: 06 May 2026 17:01 >> To: Richard Biener <[email protected]> >> Cc: [email protected]; Tamar Christina <[email protected]>; >> [email protected] >> Subject: Re: [PATCH 2/2] [x86] adjust OMP SIMD call cost >> >> Richard Biener <[email protected]> writes: >> > The following adds special handling to OMP SIMD vector call costs >> > which were not costed at all and for which a single simple vector >> > stmt isn't appropriate. PR125174 shows that even when AVX imposes >> > more overhead (from also slightly bogus costing) than SSE, when >> > there's two OMP SIMD calls involved doing less of those should trump >> > that. >> > >> > Bootstrap & regtest ongoing on x86_64-unknown-linux-gnu. >> > >> > I've verified this resolves the observed 465.tonto regression. I >> > thought about catching all OMP SIMD vectorized stmts but then >> > realized scalar costing doesn't see this yet so we'll make all >> > vectorizations unprofitable. We cannot handle all calls this >> > way either, as some directly expand to native insns (popcount, etc.). >> > So I fear we have to maintain a positive list. There's 52 >> > 'notinbranch' SIMD declatations in glibc 2.38 on x86_64, probably >> > different ones on ARM. >> > >> > Also we of course have no idea about actual cost of the call >> > (but it's expensive). Nor do we have an idea of the scalar >> > vs. vector cost. >> > >> > But as the PR shows, doing "nothing" isn't an option, at least >> > when, like on x86 there's both SSE and AVX variants and the >> > surrounding code would make the SSE variant (appear) cheaper. >> > >> > Any good ideas? >> > >> > Otherwise I'll try to extensively cover all libm builtins >> > (anticipating future SIMD-ification) in the same way, with >> > same costs. >> >> Probably a daft question, but: if the assumption is that OMP SIMD calls >> are very expensive (which I agree is reasonable!), then would there be >> any important cases in which we'd want to pick an SSE loop with OMP SIMD >> calls over an AVX loop with OMP SIMD calls? > > I think this is the same as the Adv.SIMD vs SVE example in my reply > https://godbolt.org/z/cbGM5K8fn > > SVE will always have an additional overhead because of the predicates. > So when ncopies > 1 on same VL Adv.SIMD should always be faster. > > On different VLs it's more complicated... > >> >> I just wonder whether comparing "number of OMP SIMD CALLs / estimated >> VF" >> would get us most of the way there, falling back to the current cost >> comparison when the ratios are equal. > > I guess the problem here is it also depends on the life time of the calls. > > In my examples above one of the reasons it's more expensive is because > of how the calls are materialized. I think the ncopies > 1 ones are > more problematic over counting OMP SIMD CALLS because they > artificially keep values live across the second call. > > i.e. in https://godbolt.org/z/cbGM5K8fn there's no reason for `z23` to > be kept live because the OMP calls are marked pure and const (from > what I remember) so couldn't affect memory anyway.
Yeah, it sounds like there are two aspects to it: the cost of the scaffolding needed to make the call and the cost of the call(ee) itself. I agree that the cost of the scaffolding should be part of the normal costing process and should be used to distinguish Adv SIMD and like-sized SVE. But it sounded like the x86 example was more about the cost of the call(ee): if the OMP SIMD call is assumed to do a lot of work, doing it twice for half-sized vectors is a clear loss. (Might have misunderstood though :) ) The problem with trying to cost the call(ee) based on (say) the likely number of instructions is that it falls down when costing user-defined functions. I think we'd need a fallback approach for that case, even if we can do better for well-known functions. That's why I was thinking of having "number of OMP SIMD CALLs / estimated VF" as a first-level comparison. But maybe adding magic param in the normal costing process is good enough in practice. I suppose either way is going to have corner cases that do the wrong thing... Thanks, Richard
