[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake

2023-11-25 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

liuhongt at gcc dot gnu.org changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org

--- Comment #21 from liuhongt at gcc dot gnu.org ---
The main gap is from openmp for hybrid machine.

[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake

2023-11-24 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

--- Comment #20 from Jan Hubicka  ---
On zen4 hardware I now get

GCC13 with -O3 -flto -march=native -fopenmp
2163
2161
2153

Average: 2159 Iterations Per Minute

clang 17 with -O3 -flto -march=native -fopenmp
2004
1988
1991

Average: 1994 Iterations Per Minute

trunk -O3 -flto -march=native -fopenmp
Operation: Resizing:
2126
2135
2123

Average: 2128 Iterations Per Minute

So no big changes here...

[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake

2023-10-11 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

--- Comment #19 from CVS Commits  ---
The master branch has been updated by hongtao Liu :

https://gcc.gnu.org/g:e1e127de18dbee47b88fa0ce74a1c7f4d658dc68

commit r14-4571-ge1e127de18dbee47b88fa0ce74a1c7f4d658dc68
Author: Zhang, Jun 
Date:   Fri Sep 22 23:56:37 2023 +0800

x86: set spincount 1 for x86 hybrid platform

By test, we find in hybrid platform spincount 1 is better.

Use '-march=native -Ofast -funroll-loops -flto',
results as follows:

spec2017 speed   RPL ADL
657.xz_s 0.00%   0.50%
603.bwaves_s 10.90%  26.20%
607.cactuBSSN_s  5.50%   72.50%
619.lbm_s2.40%   2.50%
621.wrf_s-7.70%  2.40%
627.cam4_s   0.50%   0.70%
628.pop2_s   48.20%  153.00%
638.imagick_s-0.10%  0.20%
644.nab_s2.30%   1.40%
649.fotonik3d_s  8.00%   13.80%
654.roms_s   1.20%   1.10%
Geomean-int  0.00%   0.50%
Geomean-fp   6.30%   21.10%
Geomean-all  5.70%   19.10%

omp2012  RPL ADL
350.md   -1.81%  -1.75%
351.bwaves   7.72%   12.50%
352.nab  14.63%  19.71%
357.bt331-0.20%  1.77%
358.botsalgn 0.00%   0.00%
359.botsspar 0.00%   0.65%
360.ilbdc0.00%   0.25%
362.fma3d2.66%   -0.51%
363.swim 10.44%  0.00%
367.imagick  0.00%   0.12%
370.mgrid331 2.49%   25.56%
371.applu331 1.06%   4.22%
372.smithwa  0.74%   3.34%
376.kdtree   10.67%  16.03%
GEOMEAN  3.34%   5.53%

include/ChangeLog:

PR target/109812
* spincount.h: New file.

libgomp/ChangeLog:

* env.c (initialize_env): Use do_adjust_default_spincount.
* config/linux/x86/spincount.h: New file.

[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake

2023-06-21 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

--- Comment #18 from Uroš Bizjak  ---
One interesting observation:

clang is able to do this:

  0.09 │ │  vmovddup -0x8(%rdx,%rsi,1),%xmm3  ▒
  ...
  0.11 │ │  vfmadd231sd  %xmm2,%xmm3,%xmm1▒
  ...
  0.74 │ │  vfmadd231pd  %xmm2,%xmm3,%xmm0▒

It figures out that duplicated V2DFmode value in %xmm3 can also be accessed in
the same register as DFmode value.

OTOH, current gcc does:

vmovsd  (%rsi,%rax,8), %xmm1
...
vmovddup%xmm1, %xmm4
...
vfmadd231pd %xmm4, %xmm0, %xmm2
...
vfmadd231sd %xmm1, %xmm0, %xmm3

The above code needs two registers.

[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake

2023-06-01 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

--- Comment #17 from Jan Hubicka  ---
I was also thinking of DCE. It looks like plausible idea.  It may leads to a
surprise where you sture same undefined variable to two places and later
compare them for equality, but that is undefined anyway.

[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake

2023-06-01 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

Jakub Jelinek  changed:

   What|Removed |Added

 CC||jakub at gcc dot gnu.org

--- Comment #16 from Jakub Jelinek  ---
Shouldn't we DCE something = x_N(D); stores when x is a VAR_DECL, at least
provided
something can't trap?  I mean, the previous content is one of the possible
uninitialized values.

[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake

2023-06-01 Thread jamborm at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

--- Comment #15 from Martin Jambor  ---
Oh, because I missed the -DOPACITY in the second command line.  The reason for
SRAs creating the repalcement is total scalarization :-/

[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake

2023-05-31 Thread jamborm at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

--- Comment #14 from Martin Jambor  ---
(In reply to Jan Hubicka from comment #13)
> The only difference between slp vectorization is:
> 
> -  # _68 = PHI <_5(3)>
> -  # _67 = PHI <_11(3)>
> -  # _66 = PHI <_16(3)>
> -  .r = _68;
> -  .g = _67;
> -  .b = _66;
> +  # _70 = PHI <_5(3)>
> +  # _69 = PHI <_11(3)>
> +  # _68 = PHI <_16(3)>
> +  .r = _70;
> +  .g = _69;
> +  .b = _68;
> +  .o = r$o_33(D);
> 
> so SRA invents r$o_33(D) even if that variable is undefined.

Is this the testcase from comment #10 ?  I don't see r$o in my dumps.

[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake

2023-05-31 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

Jan Hubicka  changed:

   What|Removed |Added

 CC||rguenther at suse dot de
   See Also||https://gcc.gnu.org/bugzill
   ||a/show_bug.cgi?id=110062

--- Comment #13 from Jan Hubicka  ---
The only difference between slp vectorization is:

-  # _68 = PHI <_5(3)>
-  # _67 = PHI <_11(3)>
-  # _66 = PHI <_16(3)>
-  .r = _68;
-  .g = _67;
-  .b = _66;
+  # _70 = PHI <_5(3)>
+  # _69 = PHI <_11(3)>
+  # _68 = PHI <_16(3)>
+  .r = _70;
+  .g = _69;
+  .b = _68;
+  .o = r$o_33(D);

so SRA invents r$o_33(D) even if that variable is undefined.

SLP vectorizer then sees it as interleaving stores:

-t.c:19:16: note:   _1 = rgbs[i_35].r;
-t.c:19:16: note:   _7 = rgbs[i_35].g;
-t.c:19:16: note:   _12 = rgbs[i_35].b;
-t.c:19:16: note:   Detected interleaving store of size 3
-t.c:19:16: note:   .r = _68;
-t.c:19:16: note:   .g = _67;
-t.c:19:16: note:   .b = _66;
+t.c:19:16: note:   _1 = rgbs[i_37].r;
+t.c:19:16: note:   _7 = rgbs[i_37].g;
+t.c:19:16: note:   _12 = rgbs[i_37].b;
+t.c:19:16: note:   Detected interleaving store of size 4
+t.c:19:16: note:   .r = _70;
+t.c:19:16: note:   .g = _69;
+t.c:19:16: note:   .b = _68;
+t.c:19:16: note:   .o = r$o_33(D);

For first case it first tries to vectorize for vector of 3 doubles and fails:

-t.c:19:16: note: .r = _68;
-t.c:19:16: note: .g = _67;
-t.c:19:16: note: .b = _66;
-t.c:19:16: note:   starting SLP discovery for node 0x2cb4fe8
-t.c:19:16: note:   Build SLP for .r = _68;
-t.c:19:16: note:   get vectype for scalar type (group size 3): double
-t.c:19:16: note:   vectype: vector(2) double
-t.c:19:16: note:   nunits = 2
-t.c:19:16: missed:   Build SLP failed: unrolling required in basic block SLP
-t.c:19:16: note:   Build SLP for .g = _67;
-t.c:19:16: note:   get vectype for scalar type (group size 3): double
-t.c:19:16: note:   vectype: vector(2) double
-t.c:19:16: note:   nunits = 2
-t.c:19:16: missed:   Build SLP failed: unrolling required in basic block SLP
-t.c:19:16: note:   Build SLP for .b = _66;
-t.c:19:16: note:   get vectype for scalar type (group size 3): double
-t.c:19:16: note:   vectype: vector(2) double
-t.c:19:16: note:   nunits = 2
-t.c:19:16: missed:   Build SLP failed: unrolling required in basic block SLP
-t.c:19:16: note:   SLP discovery for node 0x2cb4fe8 failed

And later it tries to vectorize first 2 items:

-t.c:19:16: note:   Splitting SLP group at stmt 2
-t.c:19:16: note:   Split group into 2 and 1
-t.c:19:16: note:   Starting SLP discovery for
-t.c:19:16: note: .r = _68;
-t.c:19:16: note: .g = _67;
-t.c:19:16

... and after a lot of blablabla succeeds.

If opaque field is present we start with vector of size 4:
+t.c:19:16: note: .r = _70;
+t.c:19:16: note: .g = _69;
+t.c:19:16: note: .b = _68;
+t.c:19:16: note: .o = r$o_33(D);


+t.c:19:16: note:   vect_is_simple_use: operand _70 = PHI <_5(3)>, type of def:
internal
+t.c:19:16: note:   vect_is_simple_use: operand _69 = PHI <_11(3)>, type of
def: internal
+t.c:19:16: note:   vect_is_simple_use: operand _68 = PHI <_16(3)>, type of
def: internal
+t.c:19:16: note:   vect_is_simple_use: operand r$o_33(D), type of def:
external
+t.c:19:16: missed:   treating operand as external
+t.c:19:16: note:   SLP discovery for node 0x2e80058 succeeded
+t.c:19:16: note:   SLP size 1 vs. limit 23.
+t.c:19:16: note:   Final SLP tree for instance 0x2def840:
+t.c:19:16: note:   node 0x2e80058 (max_nunits=4, refcnt=2) vector(4) double
+t.c:19:16: note:   op template: .r = _70;
+t.c:19:16: note:   stmt 0 .r = _70;
+t.c:19:16: note:   stmt 1 .g = _69;
+t.c:19:16: note:   stmt 2 .b = _68;
+t.c:19:16: note:   stmt 3 .o = r$o_33(D);
+t.c:19:16: note:   children 0x2e800d8
+t.c:19:16: note:   node (external) 0x2e800d8 (max_nunits=1, refcnt=1)
+t.c:19:16: note:   { _70, _69, _68, r$o_33(D) }

So it seems to succeed vectorizing with 4 entries but it does so for the single
return statement:

   [local count: 1063004409]:
  # i_37 = PHI 
  # r$r_40 = PHI <_5(5), r$r_25(D)(2)>
  # r$g_42 = PHI <_11(5), r$g_26(D)(2)>
  # r$b_44 = PHI <_16(5), r$b_27(D)(2)>
  # ivtmp_67 = PHI 
  _1 = rgbs[i_37].r;
  _2 = (int) _1;
  _3 = (double) _2;
  _4 = _3 * w_21(D);
  _5 = _4 + r$r_40;
  _7 = rgbs[i_37].g;
  _8 = (int) _7;
  _9 = (double) _8;
  _10 = _9 * w_21(D);
  _11 = _10 + r$g_42;
  _12 = rgbs[i_37].b;
  _13 = (int) _12;
  _14 = (double) _13;
  _15 = _14 * w_21(D);
  _16 = _15 + r$b_44;
  i_22 = i_37 + 1;
  ivtmp_66 = ivtmp_67 - 1;
  if (ivtmp_66 != 0)
goto ; [99.00%]
  else
goto ; [1.00%]

   [local count: 1052374367]:
  goto ; [100.00%]

   [local count: 10737416]:
  # _70 = PHI <_5(3)>
  # _69 = PHI <_11(3)>
  # _68 = PHI <_16(3)>
  _65 = 

[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake

2023-05-31 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

--- Comment #12 from Jan Hubicka  ---
> /home/sdp/jun/btl0/install/bin/ld: /tmp/ccnX75zI.ltrans0.ltrans.o: in
> function `main':
> :(.text.startup+0x1): undefined reference to `GMCommand'

I wonder if your plugin is configured correctly.  Can you try to build
with -flto -fuse-linker-plugin.

[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake

2023-05-29 Thread zhangjungcc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

--- Comment #11 from jun zhang  ---
Hello, Hubicka and Artem
I try to reproduce this issue in Raptor Lake,
I use -fopenmp -O3 -flto, meet the following error,
but if use -fopenmp -O3, no -flto, build ok.
Could you help me?

libtool: link: /home/sdp/jun/gcc0/install/bin/gcc -fopenmp -O3 -flto
-march=native -Wall -o utilities/gm utilities/gm.o
-L/home/sdp/jun/omp/Ofast/pts_g_gomp/install/.phoronix-test-suite/installed-tests/pts/graphics-magick-2.1.0/gm_/lib
magick/.libs/libGraphicsMagick.a -lfreetype -ljbig -ltiff -ljpeg
-lXext -lSM -lICE -lX11 -llzma -lbz2 -lz -lzstd -lm -lpthread -fopenmp
/home/sdp/jun/btl0/install/bin/ld: /tmp/ccnX75zI.ltrans0.ltrans.o: in
function `main':
:(.text.startup+0x1): undefined reference to `GMCommand'
collect2: error: ld returned 1 exit status
make[1]: *** [Makefile:6411: utilities/gm] Error 1
make[1]: Leaving directory


hubicka at gcc dot gnu.org  于2023年5月29日周一 02:50写道:
>
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812
>
> --- Comment #10 from Jan Hubicka  ---
> This is benchmarkeable version of the simplified testcase:
>
> jan@localhost:/tmp> cat t.c
> #define N 1000
> struct rgb {unsigned char r,g,b;} rgbs[N];
> int *addr;
> struct drgb {double r,g,b;
> #ifdef OPACITY
>  double o;
> #endif
> };
>
> struct drgb sum(double w)
> {
> struct drgb r;
> for (int i = 0; i < N; i++)
> {
>   r.r += rgbs[i].r * w;
>   r.g += rgbs[i].g * w;
>   r.b += rgbs[i].b * w;
> }
> return r;
> }
> jan@localhost:/tmp> cat q.c
> struct drgb {double r,g,b;
> #ifdef OPACITY
>  double o;
> #endif
> };
> struct drgb sum(double w);
> int
> main()
> {
> for (int i = 0; i < 1000; i++)
> sum(i);
> }
>
>
> jan@localhost:/tmp> gcc t.c q.c -march=native -O3 -g ; objdump -d a.out | grep
> vfmadd231pd  ; perf stat ./a.out
>   40119d:   c4 e2 d9 b8 d1  vfmadd231pd %xmm1,%xmm4,%xmm2
>
>  Performance counter stats for './a.out':
>
>  12,148.04 msec task-clock:u #1.000 CPUs
> utilized
>  0  context-switches:u   #0.000 /sec
>  0  cpu-migrations:u #0.000 /sec
>736  page-faults:u#   60.586 /sec
> 50,018,421,148  cycles:u #4.117 GHz
>220,502  stalled-cycles-frontend:u#0.00% frontend
> cycles idle
> 39,950,154,369  stalled-cycles-backend:u #   79.87% backend
> cycles idle
>120,000,191,713  instructions:u   #2.40  insn per
> cycle
>   #0.33  stalled cycles 
> per
> insn
> 10,000,048,918  branches:u   #  823.182 M/sec
>  7,959  branch-misses:u  #0.00% of all
> branches
>
>   12.149466078 seconds time elapsed
>
>   12.149084000 seconds user
>0.0 seconds sys
>
>
> jan@localhost:/tmp> gcc t.c q.c -march=native -O3 -g -DOPACITY ; objdump -d
> a.out | grep vfmadd231pd  ; perf stat ./a.out
>
>  Performance counter stats for './a.out':
>
>  12,141.11 msec task-clock:u #1.000 CPUs
> utilized
>  0  context-switches:u   #0.000 /sec
>  0  cpu-migrations:u #0.000 /sec
>735  page-faults:u#   60.538 /sec
> 50,018,839,129  cycles:u #4.120 GHz
>185,034  stalled-cycles-frontend:u#0.00% frontend
> cycles idle
> 29,963,999,798  stalled-cycles-backend:u #   59.91% backend
> cycles idle
>120,000,191,729  instructions:u   #2.40  insn per
> cycle
>   #0.25  stalled cycles 
> per
> insn
> 10,000,048,913  branches:u   #  823.652 M/sec
>  7,311  branch-misses:u  #0.00% of all
> branches
>
>   12.142252354 seconds time elapsed
>
>   12.138237000 seconds user
>0.00400 seconds sys
>
>
> So on zen2 hardware I get same performance on both.  It may be interesting to
> test it on Raptor Lake.
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.

[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake

2023-05-28 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

--- Comment #10 from Jan Hubicka  ---
This is benchmarkeable version of the simplified testcase:

jan@localhost:/tmp> cat t.c
#define N 1000
struct rgb {unsigned char r,g,b;} rgbs[N];
int *addr;
struct drgb {double r,g,b;
#ifdef OPACITY
 double o;
#endif
};

struct drgb sum(double w)
{
struct drgb r;
for (int i = 0; i < N; i++)
{
  r.r += rgbs[i].r * w;
  r.g += rgbs[i].g * w;
  r.b += rgbs[i].b * w;
}
return r;
}
jan@localhost:/tmp> cat q.c
struct drgb {double r,g,b;
#ifdef OPACITY
 double o;
#endif
};
struct drgb sum(double w);
int
main()
{
for (int i = 0; i < 1000; i++)
sum(i);
}


jan@localhost:/tmp> gcc t.c q.c -march=native -O3 -g ; objdump -d a.out | grep
vfmadd231pd  ; perf stat ./a.out
  40119d:   c4 e2 d9 b8 d1  vfmadd231pd %xmm1,%xmm4,%xmm2

 Performance counter stats for './a.out':

 12,148.04 msec task-clock:u #1.000 CPUs
utilized 
 0  context-switches:u   #0.000 /sec
 0  cpu-migrations:u #0.000 /sec
   736  page-faults:u#   60.586 /sec
50,018,421,148  cycles:u #4.117 GHz 
   220,502  stalled-cycles-frontend:u#0.00% frontend
cycles idle  
39,950,154,369  stalled-cycles-backend:u #   79.87% backend
cycles idle   
   120,000,191,713  instructions:u   #2.40  insn per
cycle
  #0.33  stalled cycles per
insn   
10,000,048,918  branches:u   #  823.182 M/sec   
 7,959  branch-misses:u  #0.00% of all
branches   

  12.149466078 seconds time elapsed

  12.149084000 seconds user
   0.0 seconds sys


jan@localhost:/tmp> gcc t.c q.c -march=native -O3 -g -DOPACITY ; objdump -d
a.out | grep vfmadd231pd  ; perf stat ./a.out

 Performance counter stats for './a.out':

 12,141.11 msec task-clock:u #1.000 CPUs
utilized 
 0  context-switches:u   #0.000 /sec
 0  cpu-migrations:u #0.000 /sec
   735  page-faults:u#   60.538 /sec
50,018,839,129  cycles:u #4.120 GHz 
   185,034  stalled-cycles-frontend:u#0.00% frontend
cycles idle  
29,963,999,798  stalled-cycles-backend:u #   59.91% backend
cycles idle   
   120,000,191,729  instructions:u   #2.40  insn per
cycle
  #0.25  stalled cycles per
insn   
10,000,048,913  branches:u   #  823.652 M/sec   
 7,311  branch-misses:u  #0.00% of all
branches   

  12.142252354 seconds time elapsed

  12.138237000 seconds user
   0.00400 seconds sys


So on zen2 hardware I get same performance on both.  It may be interesting to
test it on Raptor Lake.

[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake

2023-05-28 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

--- Comment #9 from Jan Hubicka  ---
Oddly enough simplified version of the loop SLP vectorizes for me:
struct rgb {unsigned char r,g,b;} *rgbs;
int *addr;
double *weights;
struct drgb {double r,g,b;};

struct drgb sum()
{
struct drgb r;
for (int i = 0; i < 10; i++)
{
  int j = addr[i];
  double w = weights[i];
  r.r += rgbs[j].r * w;
  r.g += rgbs[j].g * w;
  r.b += rgbs[j].b * w;
}
return r;
}
I get:
L2:
movslq  (%r9,%rdx,4), %rax
vmovsd  (%r8,%rdx,8), %xmm1
incq%rdx
leaq(%rax,%rax,2), %rax
addq%rsi, %rax
movzbl  (%rax), %ecx
vmovddup%xmm1, %xmm4
vmovd   %ecx, %xmm0
movzbl  1(%rax), %ecx
movzbl  2(%rax), %eax
vpinsrd $1, %ecx, %xmm0, %xmm0
vcvtdq2pd   %xmm0, %xmm0
vfmadd231pd %xmm4, %xmm0, %xmm2
vcvtsi2sdl  %eax, %xmm5, %xmm0
vfmadd231sd %xmm1, %xmm0, %xmm3
cmpq$10, %rdx
jne .L2


I think the actual loop is:
  [local count: 44202554]:
  _106 = _262->pixel;
  _109 = *source_231(D).columns;

   [local count: 401841405]:
  # pixel$green_332 = PHI <_124(89), pixel$green_265(53)>
  # i_357 = PHI 
  # pixel$red_371 = PHI <_119(89), pixel$red_263(53)>
  # pixel$blue_377 = PHI <_129(89), pixel$blue_267(53)>
  i.51_102 = (long unsigned int) i_357;
  _103 = i.51_102 * 16;
  _104 = _262 + _103;
  _105 = _104->pixel;
  _107 = _105 - _106;
  _108 = (long unsigned int) _107;
  _110 = _108 * _109;
  _112 = _110 + _621;
  weight_297 = _104->weight;
  _113 = _112 * 4;
  _114 = _276 + _113;
  _115 = _114->red;
  _116 = (int) _115;
  _117 = (double) _116;
  _118 = _117 * weight_297;
  _119 = _118 + pixel$red_371;
  _120 = _114->green;
 _121 = (int) _120;
  _122 = (double) _121;
  _123 = _122 * weight_297;
  _124 = _123 + pixel$green_332;
  _125 = _114->blue;
  _126 = (int) _125;
  _127 = (double) _126;
  _128 = _127 * weight_297;
  _129 = _128 + pixel$blue_377;
  i_298 = i_357 + 1;
  if (n_195 > i_298)
goto ; [89.00%]
  else
goto ; [11.00%]

   [local count: 44202554]:
  # _607 = PHI <_124(54)>
  # _606 = PHI <_119(54)>
  # _605 = PHI <_129(54)>
  goto ; [100.00%]

   [local count: 357638851]:
  goto ; [100.00%]


and SLP vectorizer seems to claim:
../magick/resize.c:1284:52: note:   _125 = _114->blue;
../magick/resize.c:1284:52: note:   _120 = _114->green;
../magick/resize.c:1284:52: note:   _115 = _114->red;
../magick/resize.c:1284:52: missed:   not consecutive access weight_297 =
_104->weight;
../magick/resize.c:1284:52: missed:   not consecutive access _105 =
_104->pixel;
../magick/resize.c:1284:52: missed:   not consecutive access _134->red =
iftmp.57_207;
../magick/resize.c:1284:52: missed:   not consecutive access _134->green =
iftmp.60_208;
../magick/resize.c:1284:52: missed:   not consecutive access _134->blue =
iftmp.63_209;
../magick/resize.c:1284:52: missed:   not consecutive access _134->opacity = 0;
../magick/resize.c:1284:52: missed:   not consecutive access _63 =
*source_231(D).columns;
../magick/resize.c:1284:52: missed:   not consecutive access _60 = _262->pixel;

Not sure if that is related to the real testcase:


struct rgb {unsigned char r,g,b;} *rgbs;
int *addr;
double *weights;
struct drgb {double r,g,b,o;};

struct drgb sum()
{
struct drgb r;
for (int i = 0; i < 10; i++)
{
  int j = addr[i];
  double w = weights[i];
  r.r += rgbs[j].r * w;
  r.g += rgbs[j].g * w;
  r.b += rgbs[j].b * w;
}
return r;
}

make us to miss the vectorization even though there is nothing using drgb->o:

sum:
.LFB0:
.cfi_startproc
movq%rdi, %r8
movqweights(%rip), %rsi
movqaddr(%rip), %rdi
vxorps  %xmm2, %xmm2, %xmm2
movqrgbs(%rip), %rcx
xorl%edx, %edx
.p2align 4
.p2align 3
.L2:
movslq  (%rdi,%rdx,4), %rax
vmovsd  (%rsi,%rdx,8), %xmm0
incq%rdx
leaq(%rax,%rax,2), %rax
addq%rcx, %rax
movzbl  (%rax), %r9d
vcvtsi2sdl  %r9d, %xmm2, %xmm1
movzbl  1(%rax), %r9d
movzbl  2(%rax), %eax
vfmadd231sd %xmm0, %xmm1, %xmm3
vcvtsi2sdl  %r9d, %xmm2, %xmm1
vfmadd231sd %xmm0, %xmm1, %xmm5
vcvtsi2sdl  %eax, %xmm2, %xmm1
vfmadd231sd %xmm0, %xmm1, %xmm4
cmpq$10, %rdx
jne .L2
vmovq   %xmm4, %xmm4
vunpcklpd   %xmm5, %xmm3, %xmm0
movq%r8, %rax
vinsertf128 $0x1, %xmm4, %ymm0, %ymm0
vmovupd %ymm0, (%r8)
vzeroupper
ret

[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake

2023-05-28 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

--- Comment #8 from Jan Hubicka  ---
Created attachment 55178
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55178=edit
Preprocessed source of VerticalFiller and HorisontalFiller

[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake

2023-05-28 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

Jan Hubicka  changed:

   What|Removed |Added

Summary|GraphicsMagick resize is a  |GraphicsMagick resize is a
   |lot slower in GCC 13.1 vs   |lot slower in GCC 13.1 vs
   |Clang 16|Clang 16 on Intel Raptor
   ||Lake

--- Comment #7 from Jan Hubicka  ---
On zen3 hardware I get GCC:

GraphicsMagick 1.3.38:
pts/graphics-magick-2.1.0 [Operation: Resizing]
Test 1 of 1
Estimated Trial Run Count:3 
Estimated Time To Completion: 4 Minutes [17:00 UTC] 
Started Run 1 @ 16:57:17
Started Run 2 @ 16:58:22
Started Run 3 @ 16:59:26

Operation: Resizing:
1390
1386
1383

Average: 1386 Iterations Per Minute
Deviation: 0.25%

clang16:

GraphicsMagick 1.3.38:
pts/graphics-magick-2.1.0 [Operation: Resizing]
Test 1 of 1
Estimated Trial Run Count:3
Estimated Time To Completion: 4 Minutes [16:54 UTC]
Started Run 1 @ 16:51:48
Started Run 2 @ 16:52:52
Started Run 3 @ 16:53:56

Operation: Resizing:
180
180
180

Average: 180 Iterations Per Minute
Deviation: 0.00%


GCC profile:
  52.07%  VerticalFilter._omp_fn.0  
  24.59%  HorizontalFilter._omp_fn.0
  11.78%  ReadCachePixels.isra.0

Clang does not seem to have openmp in it, so to get comparable runs I added 
OMP_THREAD_LIMIT=1

With this I get:
GraphicsMagick 1.3.38:
pts/graphics-magick-2.1.0 [Operation: Resizing]
Test 1 of 1
Estimated Trial Run Count:3
Estimated Time To Completion: 4 Minutes [17:17 UTC]
Started Run 1 @ 17:14:14
Started Run 2 @ 17:15:18
Started Run 3 @ 17:16:22

Operation: Resizing:
184
186
186

Average: 185 Iterations Per Minute
Deviation: 0.62%

so GCC build is still bit faster. Internal loop of VerticalFillter is:
  0.00 │4a0:┌─→mov  0x8(%rdx),%rax  ▒
  1.33 ││  vmovsd   (%rdx),%xmm1▒
  1.58 ││  add  $0x10,%rdx  ▒
  0.00 ││  sub  %r13,%rax   ▒
  4.77 ││  imul %r11,%rax   ▒
  1.01 ││  add  %rcx,%rax   ▒
  0.04 ││  movzbl   0x2(%r15,%rax,4),%r10d  ▒
  8.38 ││  vcvtsi2sd%r10d,%xmm2,%xmm0   ▒
  2.44 ││  movzbl   0x1(%r15,%rax,4),%r10d  ◆
  1.55 ││  movzbl   (%r15,%rax,4),%eax  ▒
  0.00 ││  vfmadd231sd  %xmm0,%xmm1,%xmm4   ▒
 13.91 ││  vcvtsi2sd%r10d,%xmm2,%xmm0   ▒
  1.86 ││  vfmadd231sd  %xmm0,%xmm1,%xmm5   ▒
 13.00 ││  vcvtsi2sd%eax,%xmm2,%xmm0▒
  2.02 ││  vfmadd231sd  %xmm0,%xmm1,%xmm3   ▒
 12.54 │├──cmp  %rdx,%rdi   ▒
  0.00 │└──jne  4a0 ▒

HorisontalFiller:
  0.01 │520:┌─→mov  0x8(%r8),%rdx ▒
  0.96 ││  vmovsd   (%r8),%xmm1   ▒
  1.93 ││  add  $0x10,%r8 ▒
  0.50 ││  sub  %r15,%rdx ▒
  4.02 ││  add  %r11,%rdx ▒
  2.26 ││  movzbl   0x2(%r14,%rdx,4),%ebx ▒
  0.09 ││  vcvtsi2sd%ebx,%xmm2,%xmm0  ▒
 10.10 ││  movzbl   0x1(%r14,%rdx,4),%ebx ◆
  0.92 ││  movzbl   (%r14,%rdx,4),%edx▒
  1.84 ││  vfmadd231sd  %xmm0,%xmm1,%xmm4 ▒
  6.82 ││  vcvtsi2sd%ebx,%xmm2,%xmm0  ▒
 11.15 ││  vfmadd231sd  %xmm0,%xmm1,%xmm3 ▒
 13.81 ││  vcvtsi2sd%edx,%xmm2,%xmm0  ▒
  6.16 ││  vfmadd231sd  %xmm0,%xmm1,%xmm5 ▒
  8.61 │├──cmp  %rsi,%r8  ▒
  1.56 │└──jne  520   ▒

ReadCachePixels:
   │2e0:┌─→mov(%rbx,%rax,4),%edx  ▒
 83.03 ││  mov%edx,(%r12,%rax,4)  ▒
 12.34 ││  inc%rax▒
  0.02 │├──cmp%rsi,%rax   ▒

With Clang I get:
  

[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16

2023-05-28 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

Jan Hubicka  changed:

   What|Removed |Added

 CC||hubicka at gcc dot gnu.org

--- Comment #6 from Jan Hubicka  ---
I installed the phoronix testuiste and uploaded sample data it uses to
http://www.ucw.cz/~hubicka/sample-photo-6000x4000-1.zip

I doubt they make much difference especially for resizing.

[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16

2023-05-16 Thread sjames at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

Sam James  changed:

   What|Removed |Added

 CC||sjames at gcc dot gnu.org

--- Comment #5 from Sam James  ---
All of the benchmarks in that report are from
https://github.com/phoronix-test-suite/phoronix-test-suite.

For GraphicsMagick, the relevant benchmark seems to be:
https://github.com/phoronix-test-suite/phoronix-test-suite/blob/dea5e68ba7bc0eaa3646713a8e07100ffab929b5/ob-cache/test-profiles/pts/graphics-magick-1.6.1/test-definition.xml
(it might be a different version of the test, but note that '1.6.1' does NOT
equal the graphicsmagick version)

with a script at
https://github.com/phoronix-test-suite/phoronix-test-suite/blob/dea5e68ba7bc0eaa3646713a8e07100ffab929b5/ob-cache/test-profiles/pts/graphics-magick-1.6.1/install.sh#L25.

I think it runs individual commands like this (OMP_NUM_THREADS="$NUM_CPU_CORES"
./gm benchmark -duration 60 convert DSC_6782.png $@ null), so:
* OMP_NUM_THREADS="$NUM_CPU_CORES" ./gm benchmark -duration 60 convert
DSC_6782.png -colorspace HWB null
* OMP_NUM_THREADS="$NUM_CPU_CORES" ./gm benchmark -duration 60 convert
DSC_6782.png -blur 0x1.0 null
* OMP_NUM_THREADS="$NUM_CPU_CORES" ./gm benchmark -duration 60 convert
DSC_6782.png -lat 10x10-5% null
* OMP_NUM_THREADS="$NUM_CPU_CORES" ./gm benchmark -duration 60 convert
DSC_6782.png -resize 50% HWB null
* OMP_NUM_THREADS="$NUM_CPU_CORES" ./gm benchmark -duration 60 convert
DSC_6782.png -sharpen 0x1.0 HWB null

with GraphicsMagick (gm) built as with -fopenmp -O3 -march=native -flto -ltiff
-lfreetype -ljpeg -lXext -lSM -lICE -lX11 -lbz2 -lz -lzstd -lpthread. But I
can't actually find the test image DSC_6782.png, so...

I think we really need more information here before it's actionable. Perhaps
the reporter could reach out to Michael Larabel and ask him to comment here.

[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16

2023-05-16 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

JuzheZhong  changed:

   What|Removed |Added

 CC||juzhe.zhong at rivai dot ai

--- Comment #4 from JuzheZhong  ---
Thanks for reporting this. Unfortunately, a single report can not help us.
Would you mind file a bug with simple piece of code that we can reproduce
such issue and this issue matters for the benchmark.

Besides, I have read this report. I think this may be the X86 backend issue.
We (downstream) RISC-V GCC have tested various workloads, turns out GCC is
better
than Clang in traditional CPU benchmark. Also, Clang is much better than GCC in
AI program benchmark (For example mlperf).

Start with the benchmark you mentioned (GraphicsMagick), Could you post the
most important piece of code belongging to this benchmark ?


Thanks.

[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16

2023-05-12 Thread aros at gmx dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

Artem S. Tashkinov  changed:

   What|Removed |Added

   Keywords||missed-optimization
 Status|RESOLVED|NEW
 Resolution|INVALID |---

[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16

2023-05-11 Thread aros at gmx dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

Artem S. Tashkinov  changed:

   What|Removed |Added

 Resolution|--- |INVALID
 Status|WAITING |RESOLVED

--- Comment #3 from Artem S. Tashkinov  ---
According to the latest Phoronix test which can be easily downloaded, run and
reproduced, GCC 13.1 loses to Clang by a wide margin, in certain workloads it's
~30% (!) slower and I just wanted to alert its developers to a widening gap in
performance v Clang. I'm not a developer either, I'm simply no one.

My previous bug reports for performance regressions and deficiencies weren't
met with such ... words, so, I'm sorry I'm not in a mood of proving anything,
so I'll just go ahead and close it as useless, annoying and maybe even outright
invalid.

[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16

2023-05-11 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

Andrew Pinski  changed:

   What|Removed |Added

   Last reconfirmed||2023-05-11
 Status|UNCONFIRMED |WAITING
 Target||x86_64-linux-gnu
 Ever confirmed|0   |1
  Component|tree-optimization   |target

--- Comment #2 from Andrew Pinski  ---
This bug report and the other ones are useless really. Please read
https://gcc.gnu.org/bugs/ and report a decent bug report.