Issue 176218
Summary [RISCV] Costing leads to unprofitable vectorization
Labels backend:RISC-V
Assignees bababuck
Reporter bababuck
    Extracted from x264. This gets vectorized in the SLP vectorizer, and is predicted to be very profitable, `SLP: Found cost = -32 for VF=16`. However, we've found this to run ~50% slower on our target architecture.

I suspect the big culprits are the loads, stores, and arithmetic operations:
```
MainOp:   store i16 %conv40, ptr %d, align 2, !tbaa !10
...
SLP:     VectorCost = 1
SLP: ScalarCost = 16
...

...
MainOp:   %add39 = add nsw i32 %add17, %add6
AltOp:   %sub50 = sub nsw i32 %add6, %add17
...
SLP: ReuseShuffleCost = 0
SLP:     VectorCost = 5
SLP:     ScalarCost = 16
...

...
State: Vectorize
MainOp:   %add17 = add nsw i32 %conv16, %conv11
AltOp:   %sub38 = sub nsw i32 %conv11, %conv16
...
SLP: VectorCost = 9
SLP:     ScalarCost = 16
...

...
State: Vectorize
MainOp:   %0 = load i16, ptr %s, align 2, !tbaa !10
...
SLP: VectorCost = 1
SLP:     ScalarCost = 16
```

`./bin/clang -march=rv64gv_zvl512b -O3 t2.c -mllvm -debug-_only_=SLP -S`
```
#include <stdint.h>

void foo(int16_t * restrict s, int16_t * restrict d)
{
 for(int i = 0; i < 4; i++) {
    int s03 = s[i*4+0] + s[i*4+3];
    int s12 = s[i*4+1] + s[i*4+2];
    int d03 = s[i*4+0] - s[i*4+3];
    int d12 = s[i*4+1] - s[i*4+2];

    d[0*4+i] =   s03 +   s12;
    d[1*4+i] = 2*d03 + d12;
    d[2*4+i] =   s03 -   s12;
    d[3*4+i] =   d03 - 2*d12;
 }
}
```
_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

Reply via email to