[Bug target/113827] MrBayes benchmark redundant load

2024-02-08 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113827

--- Comment #3 from Andrew Pinski  ---
(In reply to Robin Dapp from comment #0)
> A hot block in the MrBayes benchmark (as used in the Phoronix testsuite) has
> a redundant scalar load when vectorized.
> 
> Minimal example, compiled with -march=rv64gcv -O3
> 
> int foo (float **a, float f, int n)
> {
>   for (int i = 0; i < n; i++)
> {
>   a[i][0] /= f;
>   a[i][1] /= f;
>   a[i][2] /= f;
>   a[i][3] /= f;
>   a[i] += 4;
> }
> }

LLVM for aarch64 with the above testcase:
``
.L3:
ldr x2, [x0]
mov x1, x2
ldr q31, [x2]
fdivv31.4s, v31.4s, v0.4s
str q31, [x1], 16
str x1, [x0], 8  HERE
cmp x3, x0
bne .L3
```

There is a store of x1 there.
I really think you messed up reducing the testcase.

[Bug target/113827] MrBayes benchmark redundant load

2024-02-08 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113827

Andrew Pinski  changed:

   What|Removed |Added

 Ever confirmed|0   |1
 Status|UNCONFIRMED |WAITING
   Last reconfirmed||2024-02-09

--- Comment #2 from Andrew Pinski  ---
>a redundant scalar load 

I don't see any redundant load in that loop.


```
L3:
movq(%rdi), %rax   ;; load a[i] from rdi
vmovups (%rax), %xmm1  ;; load rax[0-3] into vector
vdivps  %xmm0, %xmm1, %xmm1 ;; divide
vmovups %xmm1, (%rax)  ;; store result back into rax[0-3]
addq$16, %rax   ;; add 4*4 to rax
movq%rax, (%rdi) ;; store rax back into rdi
addq$8, %rdi ;; add 8 to rdi
cmpq%rdi, %rdx
jne .L3  ;; compare and loop back
```

That is a[i] is different between each iterations.

Maybe you reduced this code too much?

[Bug target/113827] MrBayes benchmark redundant load on riscv

2024-02-08 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113827

--- Comment #1 from Robin Dapp  ---
x86 (-march=native -O3 on an i7 12th gen) looks pretty similar:

.L3:
movq(%rdi), %rax
vmovups (%rax), %xmm1
vdivps  %xmm0, %xmm1, %xmm1
vmovups %xmm1, (%rax)
addq$16, %rax
movq%rax, (%rdi)
addq$8, %rdi
cmpq%rdi, %rdx
jne .L3

So probably not target specific.  Costing?