On Wed, 6 May 2026, Richard Sandiford wrote: > Richard Biener <[email protected]> writes: > >> Am 06.05.2026 um 18:02 schrieb Richard Sandiford > >> <[email protected]>: > >> > >> Richard Biener <[email protected]> writes: > >>> The following adds special handling to OMP SIMD vector call costs > >>> which were not costed at all and for which a single simple vector > >>> stmt isn't appropriate. PR125174 shows that even when AVX imposes > >>> more overhead (from also slightly bogus costing) than SSE, when > >>> there's two OMP SIMD calls involved doing less of those should trump > >>> that. > >>> > >>> Bootstrap & regtest ongoing on x86_64-unknown-linux-gnu. > >>> > >>> I've verified this resolves the observed 465.tonto regression. I > >>> thought about catching all OMP SIMD vectorized stmts but then > >>> realized scalar costing doesn't see this yet so we'll make all > >>> vectorizations unprofitable. We cannot handle all calls this > >>> way either, as some directly expand to native insns (popcount, etc.). > >>> So I fear we have to maintain a positive list. There's 52 > >>> 'notinbranch' SIMD declatations in glibc 2.38 on x86_64, probably > >>> different ones on ARM. > >>> > >>> Also we of course have no idea about actual cost of the call > >>> (but it's expensive). Nor do we have an idea of the scalar > >>> vs. vector cost. > >>> > >>> But as the PR shows, doing "nothing" isn't an option, at least > >>> when, like on x86 there's both SSE and AVX variants and the > >>> surrounding code would make the SSE variant (appear) cheaper. > >>> > >>> Any good ideas? > >>> > >>> Otherwise I'll try to extensively cover all libm builtins > >>> (anticipating future SIMD-ification) in the same way, with > >>> same costs. > >> > >> Probably a daft question, but: if the assumption is that OMP SIMD calls > >> are very expensive (which I agree is reasonable!), then would there be > >> any important cases in which we'd want to pick an SSE loop with OMP SIMD > >> calls over an AVX loop with OMP SIMD calls? > >> > >> I just wonder whether comparing "number of OMP SIMD CALLs / estimated VF" > >> would get us most of the way there, falling back to the current cost > >> comparison when the ratios are equal. > > > > I think that would work for the case in question, but how does it work for > > Comparing against the scalar loop cost? > > We could effectively ignore it or simply make it number of calls per > > iteration. > > Guess I'm being maximalist about it, but: if the vector function isn't > at least as fast as looping over the scalar function, the vector function > shouldn't be advertised to the vectoriser. Ignoring the callee cost > for scalar vs. vector should then be conservatively correct.
True. On x86 at least a function call (scalar or vector) also causes all live vector and scalar FP regs to be spilled. That's an argument for counting the absolute number of calls in the vector loop. > But I suppose that doesn't help with the other direction, where the > vector call is such a big win that it can compensate for inline vector > code that is worse than the corresponding scalar code. Not sure how > often that would happen in practice. The case in question is where the AVX call helps compensating the surrounding AVX code being worse than if vectorized with SSE. Costing all of them (also scalar) the same fixes the relative importance compared to other stmts. I agree having a tunable for this would be useful. > > The quite arbitrary cost for SIMD calls is equally > > problematic. I guess there’s simply no perfect way. I do somewhat like > > the patch for the reason we’ll keep one thing to compare rather than > > making the compare effectively “two dimensional” > > Yeah, I can see the argument for simplicity. > > But if we did fold it all into a single cost value, and had a fallback > --param to specify the cost of user-defined functions, it'd be nice if > we would find a good value for all targets, rather than each target > having its own go at pinning the tail on the donkey. Yeah, which is why I was asking for good ideas ... we do not cost the scalar side in a reasonable way either. And currently it's the targets job to do something. But the vectorizer could provide a helper to classify functions at least. Thanks, Richard.
