On Thu, 1 Feb 2024, Andre Vieira (lists) wrote: > > > On 01/02/2024 07:19, Richard Biener wrote: > > On Wed, 31 Jan 2024, Andre Vieira (lists) wrote: > > > > > > The patch didn't come with a testcase so it's really hard to tell > > what goes wrong now and how it is fixed ... > > My bad! I had a testcase locally but never added it... > > However... now I look at it and ran it past Richard S, the codegen isn't > 'wrong', but it does have the potential to lead to some pretty slow codegen, > especially for inbranch simdclones where it transforms the SVE predicate into > an Advanced SIMD vector by inserting the elements one at a time... > > An example of which can be seen if you do: > > gcc -O3 -march=armv8-a+sve -msve-vector-bits=128 -fopenmp-simd t.c -S > > with the following t.c: > #pragma omp declare simd simdlen(4) inbranch > int __attribute__ ((const)) fn5(int); > > void fn4 (int *a, int *b, int n) > { > for (int i = 0; i < n; ++i) > b[i] = fn5(a[i]); > } > > Now I do have to say, for our main usecase of libmvec we won't have any > 'inbranch' Advanced SIMD clones, so we avoid that issue... But of course that > doesn't mean user-code will.
It seems to use SVE masks with vector(4) <signed-boolean:4> and the ABI says the mask is vector(4) int. You say that's because we choose a Adv SIMD clone for the SVE VLS vector code (it calls _ZGVnM4v_fn5). The vectorizer creates _44 = VEC_COND_EXPR <loop_mask_41, { 1, 1, 1, 1 }, { 0, 0, 0, 0 }>; and then vector lowering decomposes this. That means the vectorizer lacks a check that the target handles this VEC_COND_EXPR. Of course I would expect that SVE with VLS vectors is able to code generate this operation, so it's missing patterns in the end. Richard. > I'm gonna remove this patch and run another test regression to see if it > catches anything weird, but if not then I guess we do have the option to not > use this patch and aim to solve the costing or codegen issue in GCC-15. We > don't currently do any simdclone costing and I don't have a clear suggestion > for how given openmp has no mechanism that I know off to expose the speedup of > a simdclone over it's scalar variant, so how would we 'compare' a simdclone > call with extra overhead of argument preparation vs scalar, though at least we > could prefer a call to a different simdclone with less argument preparation. > Anyways I digress. > > Other tests, these require aarch64-autovec-preference=2 so that also has me > worried less... > > gcc -O3 -march=armv8-a+sve -msve-vector-bits=128 --param > aarch64-autovec-preference=2 -fopenmp-simd t.c -S > > t.c: > #pragma omp declare simd simdlen(2) notinbranch > float __attribute__ ((const)) fn1(double); > > void fn0 (float *a, float *b, int n) > { > for (int i = 0; i < n; ++i) > b[i] = fn1((double) a[i]); > } > > #pragma omp declare simd simdlen(2) notinbranch > float __attribute__ ((const)) fn3(float); > > void fn2 (float *a, double *b, int n) > { > for (int i = 0; i < n; ++i) > b[i] = (double) fn3(a[i]); > } > > > Richard. > > > >>> > >>> That said, I wonder how we end up mixing things up in the first place. > >>> > >>> Richard. > >> > > > > -- Richard Biener <rguent...@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)