[Bug middle-end/108376] TSVC s1279 runs 40% faster with aocc than gcc at zen4

2024-02-09 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108376

Richard Biener  changed:

   What|Removed |Added

 Resolution|--- |WONTFIX
 Status|NEW |RESOLVED

--- Comment #4 from Richard Biener  ---
So I'd say INVALID or WONTFIX.

[Bug middle-end/108376] TSVC s1279 runs 40% faster with aocc than gcc at zen4

2023-01-14 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108376

--- Comment #3 from Jan Hubicka  ---
If I make the arrays random then GCC code is indeed faster:
#include 
#include 

typedef float real_t;
#define iterations 100
#define LEN_1D 32000
#define LEN_2D 256
real_t a[LEN_1D],b[LEN_1D],c[LEN_1D],d[LEN_1D],e[LEN_1D];
real_t aa[LEN_2D][LEN_2D];
real_t bb[LEN_2D][LEN_2D];
real_t cc[LEN_2D][LEN_2D];
real_t qq;
int
main(void)
{
//reductions
//if to max reduction

real_t x;
for (int i = 0; i < LEN_1D; i++)
{
   a[i]=(rand() %5) - 3;
   b[i]=(rand() %6) - 3;
}
for (int nl = 0; nl < iterations; nl++) {
for (int i = 0; i < LEN_1D; i++) {
if (a[i] < (real_t)0.) {
if (b[i] > a[i]) {
c[i] += d[i] * e[i];
}
}
}
//dummy(a, b, c, d, e, aa, bb, cc, 0.);
}

return x;
}

jh@alberti:~/tsvc/bin> ~/aocc-compiler-4.0.0/bin/clang -Ofast s1279.c
-march=native
s1279.c:23:14: warning: implicit declaration of function 'rand' is invalid in
C99 [-Wimplicit-function-declaration]
   a[i]=(rand() %5) - 3;
 ^
1 warning generated.
jh@alberti:~/tsvc/bin> time ./a.out

real0m5.638s
user0m5.636s
sys 0m0.000s
jh@alberti:~/tsvc/bin> ~/trunk-install/bin/gcc -Ofast s1279.c -march=native
s1279.c: In function 'main':
s1279.c:23:14: warning: implicit declaration of function 'rand'
[-Wimplicit-function-declaration]
   23 |a[i]=(rand() %5) - 3;
  |  ^~~~
jh@alberti:~/tsvc/bin> time ./a.out

real0m2.791s
user0m2.790s
sys 0m0.000s


sorry for wrong code, just for reference the loop compiles as:
.L4:
xorl%eax, %eax
.p2align 4
.p2align 3
.L3:
vmovaps a(%rax), %ymm2
vmovaps b(%rax), %ymm3
vmovaps c(%rax), %ymm6
addq$32, %rax
vmovaps c-32(%rax), %ymm0
vmovaps e-32(%rax), %ymm4
vcmpps  $1, %ymm1, %ymm2, %k1
vcmpps  $14, %ymm2, %ymm3, %k1{%k1}
vfmadd231ps d-32(%rax), %ymm4, %ymm0{%k1}
vfmadd231ps d-32(%rax), %ymm4, %ymm0
vblendmps   %ymm0, %ymm6, %ymm0{%k1}
vmovaps %ymm0, c-32(%rax)
cmpq$128000, %rax
jne .L3
decl%edx
jne .L4

[Bug middle-end/108376] TSVC s1279 runs 40% faster with aocc than gcc at zen4

2023-01-12 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108376

Richard Biener  changed:

   What|Removed |Added

   Last reconfirmed||2023-01-12
 Status|UNCONFIRMED |NEW
 Ever confirmed|0   |1

--- Comment #2 from Richard Biener  ---
As far as I can see a[] is all zeros.  AOCC basically preserves the
loop control flow when if (a[i] < 0.) for all elements processed in the
iteration, likewise for if (b[i] > a[i]) but GCC if-converts this all
down to combined masking of the guarded code.

I think the testcase as-is is too artificial to be relevant.  GCC
has code to do such thing to convert masked stores, but in this case
we are not using masked stores or masked loads:

.L3:
vmovaps a(%rax), %ymm3
vmovaps b(%rax), %ymm4
vmovaps c(%rax), %ymm7
addq$32, %rax
vmovaps c-32(%rax), %ymm0
vmovaps e-32(%rax), %ymm5
vcmpps  $1, %ymm1, %ymm3, %k1
vcmpps  $14, %ymm3, %ymm4, %k1{%k1}
vfmadd231ps d-32(%rax), %ymm5, %ymm0{%k1}
vfmadd231ps d-32(%rax), %ymm5, %ymm0
vblendmps   %ymm0, %ymm7, %ymm0{%k1}
vmovaps %ymm0, c-32(%rax)
cmpq$128000, %rax
jne .L3

I suspect if you do a less optimal initialization of a/b then the AOCC
code will be slower.

Note GCC applies unroll-and-jam to the loop (the outer iteration is
visibly redundant, so we are eventually doing half of the work as AOCC ;))

Confirmed for us not vectorizing control flow but if-converting.

[Bug middle-end/108376] TSVC s1279 runs 40% faster with aocc than gcc at zen4

2023-01-11 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108376

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #1 from Alexander Monakov  ---
I think your GCC dumps are for the wrong loop.