| Issue |
178703
|
| Summary |
Cost and register usage in some partial reductions doesn't match the generated instructions
|
| Labels |
backend:AArch64,
vectorizers
|
| Assignees |
|
| Reporter |
john-brawn-arm
|
If we consider these two functions:
```
int test1(int n, char *a, char *b) {
int accum = 0;
for (int i = 0; i < n; i++) {
accum += a[i] * b[i];
}
return accum;
}
int test2(int n, char *a, char *b) {
int accum = 0;
for (int i = 0; i < n; i++) {
accum -= a[i] * b[i];
}
return accum;
}
```
Compiling with ``clang --target=aarch64-none-elf -mcpu=neoverse-v1 -O3 -mllvm -force-vector-interleave=1`` the vector loops generated for these are
```
test1:
.LBB0_7: // %vector.body
// =>This Inner Loop Header: Depth=1
ldr q1, [x12], #16
ldr q2, [x13], #16
subs x14, x14, #16
udot v0.4s, v2.16b, v1.16b
b.ne .LBB0_7
test2:
.LBB1_7: // %vector.body
// =>This Inner Loop Header: Depth=1
ldr q1, [x12], #16
ldr q2, [x13], #16
subs x14, x14, #16
umull2 v3.8h, v2.16b, v1.16b
umull v1.8h, v2.8b, v1.8b
usubw v0.4s, v0.4s, v1.4h
usubw2 v0.4s, v0.4s, v1.8h
usubw v0.4s, v0.4s, v3.4h
usubw2 v0.4s, v0.4s, v3.8h
b.ne .LBB1_7
```
If you look at what's going on in the vectorizer with ``-mllvm -debug`` then it says
```
LV: Checking a loop in 'test1' from tmp.c
Cost of 1 for VF 16: _expression_ vp<%9> = ir<%accum.09> + partial.reduce.add (mul nuw nsw (ir<%1> zext to i32), (ir<%0> zext to i32))
LV(REG): Calculating max register usage:
LV(REG): Scaled down VF from 8 to 2 for _expression_ vp<%9> = ir<%accum.09> + partial.reduce.add (mul nuw nsw (ir<%1> zext to i32), (ir<%0> zext to i32))
LV(REG): Scaled down VF from 16 to 4 for _expression_ vp<%9> = ir<%accum.09> + partial.reduce.add (mul nuw nsw (ir<%1> zext to i32), (ir<%0> zext to i32))
LV: Checking a loop in 'test2' from tmp.c
Cost of 1 for VF 16: _expression_ vp<%9> = ir<%accum.09> + partial.reduce.add (sub (0, mul nuw nsw (ir<%1> zext to i32), (ir<%0> zext to i32)))
LV(REG): Calculating max register usage:
LV(REG): Scaled down VF from 8 to 2 for _expression_ vp<%9> = ir<%accum.09> + partial.reduce.add (sub (0, mul nuw nsw (ir<%1> zext to i32), (ir<%0> zext to i32)))
LV(REG): Scaled down VF from 16 to 4 for _expression_ vp<%9> = ir<%accum.09> + partial.reduce.add (sub (0, mul nuw nsw (ir<%1> zext to i32), (ir<%0> zext to i32)))
```
It thinks that the cost of both reductions is the same, and that the relevant type in both for calculating register usage is v4i32 (which is the result type) so the register usage is 1. In test1 the reduction becomes a single udot instruction so this looks correct, but in test2 we get a sequence of 6 instructions and use one extra registers due to using an intermediate v16xi16 which gets split into two v8xi16. So both the cost and register usage are wrong.
_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs