On Sat, Sep 26, 2020 at 12:41 AM Segher Boessenkool
<seg...@kernel.crashing.org> wrote:
>
> On Fri, Sep 25, 2020 at 08:58:35AM +0200, Richard Biener wrote:
> > On Thu, Sep 24, 2020 at 9:38 PM Segher Boessenkool
> > <seg...@kernel.crashing.org> wrote:
> > > after which I get (-march=znver2)
> > >
> > > setg:
> > >         vmovd   %edi, %xmm1
> > >         vmovd   %esi, %xmm2
> > >         vpbroadcastd    %xmm1, %ymm1
> > >         vpbroadcastd    %xmm2, %ymm2
> > >         vpcmpeqd        .LC0(%rip), %ymm1, %ymm1
> > >         vpandn  %ymm0, %ymm1, %ymm0
> > >         vpand   %ymm2, %ymm1, %ymm1
> > >         vpor    %ymm0, %ymm1, %ymm0
> > >         ret
> >
> > I get with -march=znver2 -O2
> >
> >         vmovd   %edi, %xmm1
> >         vmovd   %esi, %xmm2
> >         vpbroadcastd    %xmm1, %ymm1
> >         vpbroadcastd    %xmm2, %ymm2
> >         vpcmpeqd        .LC0(%rip), %ymm1, %ymm1
> >         vpblendvb       %ymm1, %ymm2, %ymm0, %ymm0
>
> Ah, maybe my x86 compiler it too old...
>   x86_64-linux-gcc (GCC) 10.0.0 20190919 (experimental)
> not exactly old, huh.  I wonder what I do wrong then.
>
> > Now, with SSE4.2 the 16byte case compiles to
> >
> > setg:
> > .LFB0:
> >         .cfi_startproc
> >         movd    %edi, %xmm3
> >         movdqa  %xmm0, %xmm1
> >         movd    %esi, %xmm4
> >         pshufd  $0, %xmm3, %xmm0
> >         pcmpeqd .LC0(%rip), %xmm0
> >         movdqa  %xmm0, %xmm2
> >         pandn   %xmm1, %xmm2
> >         pshufd  $0, %xmm4, %xmm1
> >         pand    %xmm1, %xmm0
> >         por     %xmm2, %xmm0
> >         ret
> >
> > since there's no blend with a variable mask IIRC.
>
> PowerPC got at least *that* right since time immemorial :-)
>
> > with aarch64 and SVE it doesn't handle the 32byte case at all,
> > the 16byte case compiles to
> >
> > setg:
> > .LFB0:
> >         .cfi_startproc
> >         adrp    x2, .LC0
> >         dup     v1.4s, w0
> >         dup     v2.4s, w1
> >         ldr     q3, [x2, #:lo12:.LC0]
> >         cmeq    v1.4s, v1.4s, v3.4s
> >         bit     v0.16b, v2.16b, v1.16b
> >
> > which looks equivalent to the AVX2 code.
>
> Yes, and we can do pretty much the same on Power, too.
>
> > For all of those varying the vector element type may also
> > cause "issues" I guess.
>
> For us, as long as it stays 16B vectors, all should be fine.  There may
> be issues in the compiler, but at least the hardware has no problem with
> it ;-)
>
> > > and for powerpc (changing it to 16B vectors, -mcpu=power9) it is
> > >
> > > setg:
> > >         addis 9,2,.LC0@toc@ha
> > >         mtvsrws 32,5
> > >         mtvsrws 33,6
> > >         addi 9,9,.LC0@toc@l
> > >         lxv 45,0(9)
> > >         vcmpequw 0,0,13
> > >         xxsel 34,34,33,32
> > >         blr
>
> The -mcpu=power10 code right now is just
>
>         plxv 45,.LC0@pcrel
>         mtvsrws 32,5
>         mtvsrws 33,6
>         vcmpequw 0,0,13
>         xxsel 34,34,33,32
>         blr
>
> (exactly the same, but less memory address setup cost), so doing
> something like this as a generic version would work quite well pretty
> much everywhere I think!

Given we don't have a good way to query for variable blend support
the only change would be to use the bit and/or way which probably
would be fine (and eventually combined).  I guess that could be done
in ISEL as well (given support for generating the mask, of course).

Of course that makes it difficult for targets to opt-out (generate
the in-memory variant)...

Richard.

>
> Segher

Reply via email to