On Sat, Sep 26, 2020 at 12:41 AM Segher Boessenkool <seg...@kernel.crashing.org> wrote: > > On Fri, Sep 25, 2020 at 08:58:35AM +0200, Richard Biener wrote: > > On Thu, Sep 24, 2020 at 9:38 PM Segher Boessenkool > > <seg...@kernel.crashing.org> wrote: > > > after which I get (-march=znver2) > > > > > > setg: > > > vmovd %edi, %xmm1 > > > vmovd %esi, %xmm2 > > > vpbroadcastd %xmm1, %ymm1 > > > vpbroadcastd %xmm2, %ymm2 > > > vpcmpeqd .LC0(%rip), %ymm1, %ymm1 > > > vpandn %ymm0, %ymm1, %ymm0 > > > vpand %ymm2, %ymm1, %ymm1 > > > vpor %ymm0, %ymm1, %ymm0 > > > ret > > > > I get with -march=znver2 -O2 > > > > vmovd %edi, %xmm1 > > vmovd %esi, %xmm2 > > vpbroadcastd %xmm1, %ymm1 > > vpbroadcastd %xmm2, %ymm2 > > vpcmpeqd .LC0(%rip), %ymm1, %ymm1 > > vpblendvb %ymm1, %ymm2, %ymm0, %ymm0 > > Ah, maybe my x86 compiler it too old... > x86_64-linux-gcc (GCC) 10.0.0 20190919 (experimental) > not exactly old, huh. I wonder what I do wrong then. > > > Now, with SSE4.2 the 16byte case compiles to > > > > setg: > > .LFB0: > > .cfi_startproc > > movd %edi, %xmm3 > > movdqa %xmm0, %xmm1 > > movd %esi, %xmm4 > > pshufd $0, %xmm3, %xmm0 > > pcmpeqd .LC0(%rip), %xmm0 > > movdqa %xmm0, %xmm2 > > pandn %xmm1, %xmm2 > > pshufd $0, %xmm4, %xmm1 > > pand %xmm1, %xmm0 > > por %xmm2, %xmm0 > > ret > > > > since there's no blend with a variable mask IIRC. > > PowerPC got at least *that* right since time immemorial :-) > > > with aarch64 and SVE it doesn't handle the 32byte case at all, > > the 16byte case compiles to > > > > setg: > > .LFB0: > > .cfi_startproc > > adrp x2, .LC0 > > dup v1.4s, w0 > > dup v2.4s, w1 > > ldr q3, [x2, #:lo12:.LC0] > > cmeq v1.4s, v1.4s, v3.4s > > bit v0.16b, v2.16b, v1.16b > > > > which looks equivalent to the AVX2 code. > > Yes, and we can do pretty much the same on Power, too. > > > For all of those varying the vector element type may also > > cause "issues" I guess. > > For us, as long as it stays 16B vectors, all should be fine. There may > be issues in the compiler, but at least the hardware has no problem with > it ;-) > > > > and for powerpc (changing it to 16B vectors, -mcpu=power9) it is > > > > > > setg: > > > addis 9,2,.LC0@toc@ha > > > mtvsrws 32,5 > > > mtvsrws 33,6 > > > addi 9,9,.LC0@toc@l > > > lxv 45,0(9) > > > vcmpequw 0,0,13 > > > xxsel 34,34,33,32 > > > blr > > The -mcpu=power10 code right now is just > > plxv 45,.LC0@pcrel > mtvsrws 32,5 > mtvsrws 33,6 > vcmpequw 0,0,13 > xxsel 34,34,33,32 > blr > > (exactly the same, but less memory address setup cost), so doing > something like this as a generic version would work quite well pretty > much everywhere I think!
Given we don't have a good way to query for variable blend support the only change would be to use the bit and/or way which probably would be fine (and eventually combined). I guess that could be done in ISEL as well (given support for generating the mask, of course). Of course that makes it difficult for targets to opt-out (generate the in-memory variant)... Richard. > > Segher