[Bug middle-end/108376] TSVC s1279 runs 40% faster with aocc than gcc at zen4
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108376 Richard Biener changed: What|Removed |Added Resolution|--- |WONTFIX Status|NEW |RESOLVED --- Comment #4 from Richard Biener --- So I'd say INVALID or WONTFIX.
[Bug middle-end/108376] TSVC s1279 runs 40% faster with aocc than gcc at zen4
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108376 --- Comment #3 from Jan Hubicka --- If I make the arrays random then GCC code is indeed faster: #include #include typedef float real_t; #define iterations 100 #define LEN_1D 32000 #define LEN_2D 256 real_t a[LEN_1D],b[LEN_1D],c[LEN_1D],d[LEN_1D],e[LEN_1D]; real_t aa[LEN_2D][LEN_2D]; real_t bb[LEN_2D][LEN_2D]; real_t cc[LEN_2D][LEN_2D]; real_t qq; int main(void) { //reductions //if to max reduction real_t x; for (int i = 0; i < LEN_1D; i++) { a[i]=(rand() %5) - 3; b[i]=(rand() %6) - 3; } for (int nl = 0; nl < iterations; nl++) { for (int i = 0; i < LEN_1D; i++) { if (a[i] < (real_t)0.) { if (b[i] > a[i]) { c[i] += d[i] * e[i]; } } } //dummy(a, b, c, d, e, aa, bb, cc, 0.); } return x; } jh@alberti:~/tsvc/bin> ~/aocc-compiler-4.0.0/bin/clang -Ofast s1279.c -march=native s1279.c:23:14: warning: implicit declaration of function 'rand' is invalid in C99 [-Wimplicit-function-declaration] a[i]=(rand() %5) - 3; ^ 1 warning generated. jh@alberti:~/tsvc/bin> time ./a.out real0m5.638s user0m5.636s sys 0m0.000s jh@alberti:~/tsvc/bin> ~/trunk-install/bin/gcc -Ofast s1279.c -march=native s1279.c: In function 'main': s1279.c:23:14: warning: implicit declaration of function 'rand' [-Wimplicit-function-declaration] 23 |a[i]=(rand() %5) - 3; | ^~~~ jh@alberti:~/tsvc/bin> time ./a.out real0m2.791s user0m2.790s sys 0m0.000s sorry for wrong code, just for reference the loop compiles as: .L4: xorl%eax, %eax .p2align 4 .p2align 3 .L3: vmovaps a(%rax), %ymm2 vmovaps b(%rax), %ymm3 vmovaps c(%rax), %ymm6 addq$32, %rax vmovaps c-32(%rax), %ymm0 vmovaps e-32(%rax), %ymm4 vcmpps $1, %ymm1, %ymm2, %k1 vcmpps $14, %ymm2, %ymm3, %k1{%k1} vfmadd231ps d-32(%rax), %ymm4, %ymm0{%k1} vfmadd231ps d-32(%rax), %ymm4, %ymm0 vblendmps %ymm0, %ymm6, %ymm0{%k1} vmovaps %ymm0, c-32(%rax) cmpq$128000, %rax jne .L3 decl%edx jne .L4
[Bug middle-end/108376] TSVC s1279 runs 40% faster with aocc than gcc at zen4
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108376 Richard Biener changed: What|Removed |Added Last reconfirmed||2023-01-12 Status|UNCONFIRMED |NEW Ever confirmed|0 |1 --- Comment #2 from Richard Biener --- As far as I can see a[] is all zeros. AOCC basically preserves the loop control flow when if (a[i] < 0.) for all elements processed in the iteration, likewise for if (b[i] > a[i]) but GCC if-converts this all down to combined masking of the guarded code. I think the testcase as-is is too artificial to be relevant. GCC has code to do such thing to convert masked stores, but in this case we are not using masked stores or masked loads: .L3: vmovaps a(%rax), %ymm3 vmovaps b(%rax), %ymm4 vmovaps c(%rax), %ymm7 addq$32, %rax vmovaps c-32(%rax), %ymm0 vmovaps e-32(%rax), %ymm5 vcmpps $1, %ymm1, %ymm3, %k1 vcmpps $14, %ymm3, %ymm4, %k1{%k1} vfmadd231ps d-32(%rax), %ymm5, %ymm0{%k1} vfmadd231ps d-32(%rax), %ymm5, %ymm0 vblendmps %ymm0, %ymm7, %ymm0{%k1} vmovaps %ymm0, c-32(%rax) cmpq$128000, %rax jne .L3 I suspect if you do a less optimal initialization of a/b then the AOCC code will be slower. Note GCC applies unroll-and-jam to the loop (the outer iteration is visibly redundant, so we are eventually doing half of the work as AOCC ;)) Confirmed for us not vectorizing control flow but if-converting.
[Bug middle-end/108376] TSVC s1279 runs 40% faster with aocc than gcc at zen4
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108376 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #1 from Alexander Monakov --- I think your GCC dumps are for the wrong loop.