| Issue |
176879
|
| Summary |
[X86] Match PCLMULQDQ codegen with llvm.clmul intrinsic implementation
|
| Labels |
backend:X86,
missed-optimization
|
| Assignees |
|
| Reporter |
RKSimon
|
https://rust.godbolt.org/z/Gznh7aMrY
Attempting to recreate the PCLMULQDQ instruction using generic llvm.clmul results in less than idea codegen:
```ll
define <2 x i64> @pclmul(<2 x i64> %v0, <2 x i64> %v1) {
%i0 = zext i1 0 to i64 ; constant time lo/hi select
%i1 = zext i1 1 to i64 ; constant time lo/hi select
%a0 = extractelement <2 x i64> %v0, i64 %i0
%a1 = extractelement <2 x i64> %v1, i64 %i1
%x0 = zext i64 %a0 to i128
%x1 = zext i64 %a1 to i128
%cl = call i128 @llvm.clmul.i128(i128 %x0, i128 %x1)
%r = bitcast i128 %cl to <2 x i64>
ret <2 x i64> %r
}
```
```asm
pclmul: # @pclmul
vpshufd $238, %xmm1, %xmm1 # xmm1 = xmm1[2,3,2,3]
xorl %eax, %eax
vmovq %rax, %xmm2
vpclmulqdq $0, %xmm2, %xmm1, %xmm3
vmovq %xmm3, %rax
vpclmulqdq $0, %xmm2, %xmm0, %xmm2
vmovq %xmm2, %rcx
xorq %rax, %rcx
vpclmulqdq $0, %xmm1, %xmm0, %xmm0
vpextrq $1, %xmm0, %rax
xorq %rcx, %rax
vmovq %rax, %xmm1
vpunpcklqdq %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0],xmm1[0]
retq
```
- Failing to fold clmul(x,0) -> 0
- Failing to fold shuffles into pclmulqdq masks - pclmulqdq(shuffle(x),y,c0) -> pclmulqdq(x,y,c1)
- Avoiding fpu <-> gpu traffic
This ticket isn't about removing the the PCLMULQDQ intrinsics - just ensuring that llvm.clmul lowering is reasonably efficient so we can safely use it for other bit twiddling tricks.
_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs