Issue 165964
Summary [X86] Vector 8-bit shifts by variable amounts should use power of two multiply
Labels new issue
Assignees
Reporter WalterKruger
    8-bit shifts by a variable amount are implemented by three selections between a shift by a power of two and the unshifted value based on the corresponding amount bits:

```asm
shiftVarU8_sse41:
        movdqa  xmm2, xmm0
 movdqa  xmm3, xmm0
        psllw   xmm3, 4
        pand    xmm3, xmmword ptr [rip + .LCPI1_0]
        psllw   xmm1, 5
        movdqa  xmm0, xmm1
 pblendvb        xmm2, xmm3, xmm0
        movdqa  xmm3, xmm2
 psllw   xmm3, 2
        pand    xmm3, xmmword ptr [rip + .LCPI1_1]
 paddb   xmm1, xmm1
        movdqa  xmm0, xmm1
        pblendvb        xmm2, xmm3, xmm0
        movdqa  xmm3, xmm2
        paddb   xmm3, xmm2
 paddb   xmm1, xmm1
        movdqa  xmm0, xmm1
        pblendvb        xmm2, xmm3, xmm0
        movdqa  xmm0, xmm2
        ret
```

If SSSE3 is available, it is cheaper to instead multiply by a power of two which can be obtained by using a shuffle as a lookup table (although clang tends to implement multiplies less efficiently):

```asm
shiftVarU8_ideal:
 pand    xmm1, xmmword ptr [rip + .LCPI2_0]
        movq    xmm2, qword ptr [rip + .LCPI2_1]
        pshufb  xmm2, xmm1
        movdqa  xmm1, xmm2
 pmullw  xmm1, xmm0
        pand    xmm1, xmmword ptr [rip + .LCPI2_2]
 pand    xmm2, xmmword ptr [rip + .LCPI2_3]
        psrlw   xmm0, 8
 pmullw  xmm0, xmm2
        por     xmm0, xmm1
        ret
```

The first `pand` may be skipped if amounts greater than 7 is considered undefined. This method appears to be optimal until AVX512BW.


https://godbolt.org/z/GabW4h6TG

_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

Reply via email to