| Issue |
71515
|
| Summary |
[AArch64] Missed vectorisation opportunity (tsvc, s122)
|
| Labels |
backend:AArch64,
vectorization
|
| Assignees |
|
| Reporter |
sjoerdmeijer
|
We are not vectorising kernel s122 from TSVC whereas GCC is vectorising it. As a result we are about 2x slower with Clang. Compile this input with `-O3 -ffast-math -mcpu=neoverse-v2`:
```
__attribute__((aligned(64))) float x[32000];
__attribute__((aligned(64))) float a[32000],b[32000],c[32000],d[32000],e[32000],
aa[256][256],bb[256][256],cc[256][256],tt[256][256];
int dummy(float[32000], float[32000], float[32000], float[32000], float[32000], float[256][256], float[256][256], float[256][256], float);
float s122(int xa, int xb)
{
int n1 = xa;
int n3 = xb;
int j, k;
for (int nl = 0; nl < 100000; nl++) {
j = 1;
k = 0;
for (int i = n1-1; i < 32000; i += n3) {
k += j;
a[i] += b[32000 - k];
}
dummy(a, b, c, d, e, aa, bb, cc, 0.);
}
}
```
Clang's codegen:
```
.LBB0_3: // Parent Loop BB0_2 Depth=1
ldr s0, [x8], #-4
ldr s1, [x19, x9, lsl #2]
fadd s0, s1, s0
str s0, [x19, x9, lsl #2]
add x9, x9, x21
cmp x9, x22
b.lt .LBB0_3
```
GCC's codegen:
```
.L6:
ldr q30, [x1], -16
ldr q29, [x0]
mov v31.16b, v30.16b
tbl v30.16b, {v30.16b - v31.16b}, v28.16b
fadd v31.4s, v29.4s, v30.4s
str q31, [x0], 16
cmp x22, x0
bne .L6
```
See also:
https://godbolt.org/z/7zzrPazM7
TODO: root cause analysis.
_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs