On Thu, 1 Feb 2024, Andre Vieira (lists) wrote:

> 
> 
> On 01/02/2024 07:19, Richard Biener wrote:
> > On Wed, 31 Jan 2024, Andre Vieira (lists) wrote:
> > 
> > 
> > The patch didn't come with a testcase so it's really hard to tell
> > what goes wrong now and how it is fixed ...
> 
> My bad! I had a testcase locally but never added it...
> 
> However... now I look at it and ran it past Richard S, the codegen isn't
> 'wrong', but it does have the potential to lead to some pretty slow codegen,
> especially for inbranch simdclones where it transforms the SVE predicate into
> an Advanced SIMD vector by inserting the elements one at a time...
> 
> An example of which can be seen if you do:
> 
> gcc -O3 -march=armv8-a+sve -msve-vector-bits=128  -fopenmp-simd t.c -S
> 
> with the following t.c:
> #pragma omp declare simd simdlen(4) inbranch
> int __attribute__ ((const)) fn5(int);
> 
> void fn4 (int *a, int *b, int n)
> {
>     for (int i = 0; i < n; ++i)
>         b[i] = fn5(a[i]);
> }
> 
> Now I do have to say, for our main usecase of libmvec we won't have any
> 'inbranch' Advanced SIMD clones, so we avoid that issue... But of course that
> doesn't mean user-code will.

It seems to use SVE masks with vector(4) <signed-boolean:4> and the
ABI says the mask is vector(4) int.  You say that's because we choose
a Adv SIMD clone for the SVE VLS vector code (it calls _ZGVnM4v_fn5).

The vectorizer creates

  _44 = VEC_COND_EXPR <loop_mask_41, { 1, 1, 1, 1 }, { 0, 0, 0, 0 }>;

and then vector lowering decomposes this.  That means the vectorizer
lacks a check that the target handles this VEC_COND_EXPR.

Of course I would expect that SVE with VLS vectors is able to
code generate this operation, so it's missing patterns in the end.

Richard.

> I'm gonna remove this patch and run another test regression to see if it
> catches anything weird, but if not then I guess we do have the option to not
> use this patch and aim to solve the costing or codegen issue in GCC-15. We
> don't currently do any simdclone costing and I don't have a clear suggestion
> for how given openmp has no mechanism that I know off to expose the speedup of
> a simdclone over it's scalar variant, so how would we 'compare' a simdclone
> call with extra overhead of argument preparation vs scalar, though at least we
> could prefer a call to a different simdclone with less argument preparation.
> Anyways I digress.
> 
> Other tests, these require aarch64-autovec-preference=2 so that also has me
> worried less...
> 
> gcc -O3 -march=armv8-a+sve -msve-vector-bits=128 --param
> aarch64-autovec-preference=2 -fopenmp-simd t.c -S
> 
> t.c:
> #pragma omp declare simd simdlen(2) notinbranch
> float __attribute__ ((const)) fn1(double);
> 
> void fn0 (float *a, float *b, int n)
> {
>     for (int i = 0; i < n; ++i)
>         b[i] = fn1((double) a[i]);
> }
> 
> #pragma omp declare simd simdlen(2) notinbranch
> float __attribute__ ((const)) fn3(float);
> 
> void fn2 (float *a, double *b, int n)
> {
>     for (int i = 0; i < n; ++i)
>         b[i] = (double) fn3(a[i]);
> }
> 
> > Richard.
> > 
> >>>
> >>> That said, I wonder how we end up mixing things up in the first place.
> >>>
> >>> Richard.
> >>
> > 
> 
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Reply via email to