https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113517

            Bug ID: 113517
           Summary: vector SLP cost model should be improved
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: enhancement
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: pinskia at gcc dot gnu.org
  Target Milestone: ---
            Target: aarch64

Take:
```
void f(short *a, signed int *b)
{
        int sum = 0;
        sum += a[0];
        sum += a[1];
        sum += a[2];
        sum += a[3];
        *b = sum;
}
```

Right now by default this produces:
```
        ldrsh   w3, [x0]
        ldrsh   w4, [x0, 2]
        ldrsh   w2, [x0, 4]
        add     w3, w3, w4
        ldrsh   w0, [x0, 6]
        add     w2, w2, w3
        add     w0, w0, w2
        str     w0, [x1]
```

But disabling the cost model we get:
```
        ldr     d31, [x0]
        saddlv  s31, v31.4h
        str     s31, [x1]
        ret
```

(note this is better code generation than what LLVM produces as that uses
sshll/addv ).

For most cores, doing a float (vector) load and one vector instruction and one
vector store is better than doing 4 scalar loads and 3 scalar instructions and
one scalar store.  This is true on even ThunderX 1 and Cortex-A57.

Reply via email to