| Issue |
161446
|
| Summary |
[AArch64] Suboptimal use of addressing modes
|
| Labels |
backend:AArch64,
missed-optimization
|
| Assignees |
|
| Reporter |
Kmeakin
|
https://godbolt.org/z/j1P3axEbd
For loads and stores of the form `p[k * x]` for constant, non-power of two `k`, LLVM generates longer and/or slower code than GCC in many cases
# Example 1
For `p[3 * x]`:
## LLVM
```asm
ldrw_3x(unsigned int*, unsigned long):
mov w8, #12
mul x8, x1, x8
ldr w0, [x0, x8]
ret
```
## GCC
```asm
ldrw_3x(unsigned int*, unsigned long):
add x1, x1, x1, lsl 1
ldr w0, [x0, x1, lsl 2]
ret
```
# Example 2
For `p[6 * x]`; even though GCC generates the same number of instructions, they are faster, according to MCA:
## LLVM
```asm
ldrw_6x(unsigned int*, unsigned long):
mov w8, #24
mul x8, x1, x8
ldr w0, [x0, x8]
ret
Iterations: 100
Instructions: 400
Total Cycles: 702
Total uOps: 400
Dispatch Width: 3
uOps Per Cycle: 0.57
IPC: 0.57
Block RThroughput: 2.0
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
1 1 0.33 mov w8, #24
1 5 2.00 mul x8, x1, x8
1 2 0.50 * ldr w0, [x0, x8]
1 1 1.00 U ret
```
## GCC
```asm
ldrw_6x(unsigned int*, unsigned long):
add x1, x1, x1, lsl 1
lsl x1, x1, 3
ldr w0, [x0, x1]
ret
Iterations: 100
Instructions: 400
Total Cycles: 502
Total uOps: 400
Dispatch Width: 3
uOps Per Cycle: 0.80
IPC: 0.80
Block RThroughput: 1.3
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
1 2 0.33 add x1, x1, x1, lsl #1
1 2 0.33 lsl x1, x1, #3
1 2 0.50 * ldr w0, [x0, x1]
1 1 1.00 U ret
```
_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs