| Issue |
165964
|
| Summary |
[X86] Vector 8-bit shifts by variable amounts should use power of two multiply
|
| Labels |
new issue
|
| Assignees |
|
| Reporter |
WalterKruger
|
8-bit shifts by a variable amount are implemented by three selections between a shift by a power of two and the unshifted value based on the corresponding amount bits:
```asm
shiftVarU8_sse41:
movdqa xmm2, xmm0
movdqa xmm3, xmm0
psllw xmm3, 4
pand xmm3, xmmword ptr [rip + .LCPI1_0]
psllw xmm1, 5
movdqa xmm0, xmm1
pblendvb xmm2, xmm3, xmm0
movdqa xmm3, xmm2
psllw xmm3, 2
pand xmm3, xmmword ptr [rip + .LCPI1_1]
paddb xmm1, xmm1
movdqa xmm0, xmm1
pblendvb xmm2, xmm3, xmm0
movdqa xmm3, xmm2
paddb xmm3, xmm2
paddb xmm1, xmm1
movdqa xmm0, xmm1
pblendvb xmm2, xmm3, xmm0
movdqa xmm0, xmm2
ret
```
If SSSE3 is available, it is cheaper to instead multiply by a power of two which can be obtained by using a shuffle as a lookup table (although clang tends to implement multiplies less efficiently):
```asm
shiftVarU8_ideal:
pand xmm1, xmmword ptr [rip + .LCPI2_0]
movq xmm2, qword ptr [rip + .LCPI2_1]
pshufb xmm2, xmm1
movdqa xmm1, xmm2
pmullw xmm1, xmm0
pand xmm1, xmmword ptr [rip + .LCPI2_2]
pand xmm2, xmmword ptr [rip + .LCPI2_3]
psrlw xmm0, 8
pmullw xmm0, xmm2
por xmm0, xmm1
ret
```
The first `pand` may be skipped if amounts greater than 7 is considered undefined. This method appears to be optimal until AVX512BW.
https://godbolt.org/z/GabW4h6TG
_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs