Issue |
135486
|
Summary |
[LLVM] APInt::tcAdd has quiet poor codegen.
|
Labels |
new issue
|
Assignees |
|
Reporter |
Ralender
|
Here is an example of 1 iteration of the loop
Compiled with clang trunk -O3
```asm
; start of iteration
89: 4c 8b 14 cf movq (%rdi,%rcx,8), %r10 ; load dst[i]
8d: 4c 8b 0c ce movq (%rsi,%rcx,8), %r9 ; load rhs[i]
91: 48 85 d2 testq %rdx, %rdx ; test if we have a carry
94: 74 1a je 0xb0 <_ZN4llvm5APInt5tcAddEPmPKmmj+0xb0>
96: 4f 8d 4c 0a 01 leaq 0x1(%r10,%r9), %r9 ; add with a carry
9b: 4d 39 d1 cmpq %r10, %r9
9e: 41 0f 96 c2 setbe %r10b ; put carry of next iteration in r10b
a2: eb 13 jmp 0xb7 <_ZN4llvm5APInt5tcAddEPmPKmmj+0xb7>
b0: 4d 01 d1 addq %r10, %r9 ; add without carry
b3: 41 0f 92 c2 setb %r10b ; put carry of next iteration in r10b
b7: 4c 89 0c cf movq %r9, (%rdi,%rcx,8) ; write result back to memory
; next iteration unrolled
bb: 4c 8b 4c cf 08 movq 0x8(%rdi,%rcx,8), %r9
c0: 48 8b 54 ce 08 movq 0x8(%rsi,%rcx,8), %rdx
c5: 45 84 d2 testb %r10b, %r10b
; ...
```
Just as a reference here is What I think optimal x86-64 code looks like
```
; start of iteration
; carry is in CF,
12a1: 4a 8b 4c ce f0 movq -0x10(%rsi,%r9,8), %rcx ; load rhs[i]
12a6: 4a 11 4c cf f0 adcq %rcx, -0x10(%rdi,%r9,8) ; lhs[i] = lhs[i] + rhs[i] + carry
; next iteration unrolled
12ab: 4a 8b 4c ce e8 movq -0x18(%rsi,%r9,8), %rcx
12b0: 4a 11 4c cf e8 adcq %rcx, -0x18(%rdi,%r9,8)
;...
```
Stuff that could be done better:
- one side of the branch detects overflow with a cmp instead of CF
- The branch doesn't get eliminated.
- adc is not used.
I made a benchmark to get an idea of how big the impact is.
APInt::tcAdd : 1728.4 ns for 4096 iterations in average
optimized code : 1045.6 ns for 4096 iterations in average
_______________________________________________
llvm-bugs mailing list
llvm-bugs@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs