| Issue |
176218
|
| Summary |
[RISCV] Costing leads to unprofitable vectorization
|
| Labels |
backend:RISC-V
|
| Assignees |
bababuck
|
| Reporter |
bababuck
|
Extracted from x264. This gets vectorized in the SLP vectorizer, and is predicted to be very profitable, `SLP: Found cost = -32 for VF=16`. However, we've found this to run ~50% slower on our target architecture.
I suspect the big culprits are the loads, stores, and arithmetic operations:
```
MainOp: store i16 %conv40, ptr %d, align 2, !tbaa !10
...
SLP: VectorCost = 1
SLP: ScalarCost = 16
...
...
MainOp: %add39 = add nsw i32 %add17, %add6
AltOp: %sub50 = sub nsw i32 %add6, %add17
...
SLP: ReuseShuffleCost = 0
SLP: VectorCost = 5
SLP: ScalarCost = 16
...
...
State: Vectorize
MainOp: %add17 = add nsw i32 %conv16, %conv11
AltOp: %sub38 = sub nsw i32 %conv11, %conv16
...
SLP: VectorCost = 9
SLP: ScalarCost = 16
...
...
State: Vectorize
MainOp: %0 = load i16, ptr %s, align 2, !tbaa !10
...
SLP: VectorCost = 1
SLP: ScalarCost = 16
```
`./bin/clang -march=rv64gv_zvl512b -O3 t2.c -mllvm -debug-_only_=SLP -S`
```
#include <stdint.h>
void foo(int16_t * restrict s, int16_t * restrict d)
{
for(int i = 0; i < 4; i++) {
int s03 = s[i*4+0] + s[i*4+3];
int s12 = s[i*4+1] + s[i*4+2];
int d03 = s[i*4+0] - s[i*4+3];
int d12 = s[i*4+1] - s[i*4+2];
d[0*4+i] = s03 + s12;
d[1*4+i] = 2*d03 + d12;
d[2*4+i] = s03 - s12;
d[3*4+i] = d03 - 2*d12;
}
}
```
_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs