[Bug tree-optimization/110062] missed vectorization in graphicsmagick

2023-11-26 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062

--- Comment #12 from Richard Biener  ---
(In reply to Jan Hubicka from comment #11)
> trunk -O3 -flto -march=native -fopenmp
> Operation: Sharpen:
> 257
> 256
> 256
> 
> Average: 256 Iterations Per Minute
> GCC13 -O3 -flto -march=native -fopenmp
> 257
> 256
> 256
> 
> Average: 256 Iterations Per Minute
> clang17 O3 -flto -march=native -fopenmp
>Operation: Sharpen:
> 257
> 256
> 256
> Average: 256 Iterations Per Minute
> 
> So I guess I will need to try on zen3 to see if there is any difference.
> 
> the internal loop is:
>   0.00 │460:┌─→movzbl  0x2(%rdx,%rax,4),%esi ▒
>   0.02 ││  vmovss  (%r8,%rax,4),%xmm2▒
>   0.95 ││  vcvtsi2ss   %esi,%xmm0,%xmm1  ▒
>  20.22 ││  movzbl  0x1(%rdx,%rax,4),%esi ▒
>   0.01 ││  vfmadd231ss %xmm1,%xmm2,%xmm3 ▒
>  11.97 ││  vcvtsi2ss   %esi,%xmm0,%xmm1  ▒
>  18.76 ││  movzbl  (%rdx,%rax,4),%esi▒
>   0.00 ││  inc %rax  ▒
>   0.72 ││  vfmadd231ss %xmm1,%xmm2,%xmm4 ▒
>  12.55 ││  vcvtsi2ss   %esi,%xmm0,%xmm1  ▒
>  14.95 ││  vfmadd231ss %xmm1,%xmm2,%xmm5 ▒
>  15.93 │├──cmp %rax,%r13 ▒
>   0.35 │└──jne 460  
> 
> 
> so it still does not get

As said the VF is going to be prohibitively large, likely the vector code
is never entered and the above is the scalar "epilogue".

[Bug tree-optimization/110062] missed vectorization in graphicsmagick

2023-11-25 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062

--- Comment #11 from Jan Hubicka  ---
trunk -O3 -flto -march=native -fopenmp
Operation: Sharpen:
257
256
256

Average: 256 Iterations Per Minute
GCC13 -O3 -flto -march=native -fopenmp
257
256
256

Average: 256 Iterations Per Minute
clang17 O3 -flto -march=native -fopenmp
   Operation: Sharpen:
257
256
256
Average: 256 Iterations Per Minute

So I guess I will need to try on zen3 to see if there is any difference.

the internal loop is:
  0.00 │460:┌─→movzbl  0x2(%rdx,%rax,4),%esi ▒
  0.02 ││  vmovss  (%r8,%rax,4),%xmm2▒
  0.95 ││  vcvtsi2ss   %esi,%xmm0,%xmm1  ▒
 20.22 ││  movzbl  0x1(%rdx,%rax,4),%esi ▒
  0.01 ││  vfmadd231ss %xmm1,%xmm2,%xmm3 ▒
 11.97 ││  vcvtsi2ss   %esi,%xmm0,%xmm1  ▒
 18.76 ││  movzbl  (%rdx,%rax,4),%esi▒
  0.00 ││  inc %rax  ▒
  0.72 ││  vfmadd231ss %xmm1,%xmm2,%xmm4 ▒
 12.55 ││  vcvtsi2ss   %esi,%xmm0,%xmm1  ▒
 14.95 ││  vfmadd231ss %xmm1,%xmm2,%xmm5 ▒
 15.93 │├──cmp %rax,%r13 ▒
  0.35 │└──jne 460  

so it still does not get

[Bug tree-optimization/110062] missed vectorization in graphicsmagick

2023-07-31 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062

--- Comment #10 from Richard Biener  ---
We now also apply SLP vectorizing the loop, but as said the high VF is probably
prohibitive and causes quite some spilling:

.L7:
vmovdqu (%r14), %ymm2
vmovdqu 32(%r14), %ymm1
subq$-128, %r14
subq$-128, %rdx
vmovups -128(%rdx), %ymm10
vmovdqu -64(%r14), %ymm0
vpshufb .LC7(%rip), %ymm2, %ymm4
vmovups -96(%rdx), %ymm9
vmovups -64(%rdx), %ymm8
vpshufb .LC8(%rip), %ymm1, %ymm3
vpermq  $78, %ymm4, %ymm4
vpermq  $78, %ymm3, %ymm3
...
vmulps  %ymm7, %ymm0, %ymm0
vaddps  136(%rsp), %ymm0, %ymm7
vaddps  %ymm3, %ymm15, %ymm15
vmovaps %ymm4, 168(%rsp)
vmovaps %ymm7, 136(%rsp)
cmpq%r13, %r14
jne .L7

Maybe we should more aggressively reject vectorization when the VF is
equal to the smallest element number of vector lanes.  When we then
also detect SLP this usually means BB-level SLP can do something.
Note we fail to support V2SF -> V2QI now, not sure what changed here.
vectorizable_conversion doesn't support float->int->short->char but
only either float->char, float->int->char or float->short->char, but
at least for 2-element vectors we don't support these (the vectorizer
could support extra intermediate steps as well).

[Bug tree-optimization/110062] missed vectorization in graphicsmagick

2023-06-21 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062

--- Comment #9 from Richard Biener  ---
Note SLPing k[u] won't help to reduce the VF, only selecting a smaller vector
size would.  The alternative is to have a power-of-two group size by using
masking for the 'opacity' field.

[Bug tree-optimization/110062] missed vectorization in graphicsmagick

2023-06-21 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062

--- Comment #8 from Richard Biener  ---
Since r14-2007-g6f19cf7526168f we now vectorize the loop but without SLP
which means we get interleaving and a vectorization factor of 64.  Turning
off loop vectorization yields the following which is now comparable to
what clang does.  Of course the loop vectorized interleaving is inefficient
in the end ...

.p2align 4
.p2align 3
.L3:
movq%rax, %rdx
movq%rdi, %rax
.p2align 4
.p2align 3
.L4:
vpinsrw $0, (%rax), %xmm0, %xmm0
vmovss  (%rdx), %xmm1
movzbl  2(%rax), %ecx
addq$4, %rdx
addq$4, %rax
vpmovzxbd   %xmm0, %xmm0
vmovsldup   %xmm1, %xmm4
vcvtdq2ps   %xmm0, %xmm0
vfmadd231ps %xmm4, %xmm0, %xmm2
vcvtsi2ssl  %ecx, %xmm5, %xmm0
vfmadd231ss %xmm0, %xmm1, %xmm3
cmpq%rsi, %rdx
jne .L4
incl%r9d
movq%r11, %rax
addq%rbx, %rdi
addq%rbp, %rsi
cmpl%r9d, %r10d
je  .L2
addq%rbp, %r11
jmp .L3
.p2align 4
.p2align 3
.L2:
vcvttps2dq  %xmm2, %xmm2
vpmovdb %xmm2, %xmm2
popq%rbx
.cfi_def_cfa_offset 16
vcvttss2sil %xmm3, %eax
popq%rbp
.cfi_def_cfa_offset 8
vpextrw $0, %xmm2, (%r8)
movb%al, 2(%r8)
movb$-1, 3(%r8)
ret

The loop cost modeling looks like

t.c:9:23: note:  Cost model analysis:
  Vector inside of loop cost: 1156
  Vector prologue cost: 24
  Vector epilogue cost: 5488
  Scalar iteration cost: 168
  Scalar outside cost: 32
  Vector outside cost: 5512
  prologue iterations: 0
  epilogue iterations: 32
  Calculated minimum iters for profitability: 33
t.c:9:23: note:Runtime profitability threshold = 64
t.c:9:23: note:Static estimate profitability threshold = 64

and we get a VF == 32 vectorized epilog as well:

t.c:9:23: note:  Cost model analysis: 
  Vector inside of loop cost: 620
  Vector prologue cost: 12
  Vector epilogue cost: 2752
  Scalar iteration cost: 168
  Scalar outside cost: 32 
  Vector outside cost: 2764
  prologue iterations: 0
  epilogue iterations: 16
  Calculated minimum iters for profitability: 17
t.c:9:23: note:Runtime profitability threshold = 32
t.c:9:23: note:Static estimate profitability threshold = 32

so at least we'll enter the BB SLP optimized scalar epilog in the likely case.

[Bug tree-optimization/110062] missed vectorization in graphicsmagick

2023-06-18 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062

--- Comment #7 from Hongtao.liu  ---

> pixel$red_60(D)(10)>, type of def: internal
> t.c:18:27: missed:   no optab.
> t.c:18:27: missed:   not vectorized: relevant stmt not supported: _29 =
> (unsigned char) pixel$red_78;
> t.c:18:27: note:   Building vector operands of 0x4215e90 from scalars instead
> 
> that's float -> unsigned char
> 
A patch is posted to support vectorization between float and unsigned char
https://gcc.gnu.org/pipermail/gcc-patches/2023-June/620402.html

[Bug tree-optimization/110062] missed vectorization in graphicsmagick

2023-06-16 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062

--- Comment #6 from Richard Biener  ---
Btw, we would also be able to vectorize just the red and green channel:

t.c:18:27: note: * Analysis succeeded with vector mode V4SF
t.c:18:27: note: SLPing BB part
t.c:18:27: note: Costing subgraph:
t.c:18:27: note: node 0x420b6c8 (max_nunits=2, refcnt=1) vector(2) unsigned
char
t.c:18:27: note: op template: q_45(D)->red = _29;
t.c:18:27: note:stmt 0 q_45(D)->red = _29;
t.c:18:27: note:stmt 1 q_45(D)->green = _31;
t.c:18:27: note:children 0x420b750
t.c:18:27: note: node (external) 0x420b750 (max_nunits=2, refcnt=1) vector(2)
unsigned char
t.c:18:27: note:stmt 0 _29 = (unsigned char) pixel$red_78;
t.c:18:27: note:stmt 1 _31 = (unsigned char) pixel$green_84;
t.c:18:27: note:children 0x420b7d8
t.c:18:27: note: node 0x420b7d8 (max_nunits=2, refcnt=1) vector(2) float
t.c:18:27: note: op template: pixel$red_78 = PHI <_142(11),
pixel$red_60(D)(10)>
t.c:18:27: note:stmt 0 pixel$red_78 = PHI <_142(11),
pixel$red_60(D)(10)>
t.c:18:27: note:stmt 1 pixel$green_84 = PHI <_144(11),
pixel$green_61(D)(10)>
t.c:18:27: note:children 0x420b860 0x420be38
t.c:18:27: note: node 0x420b860 (max_nunits=2, refcnt=1) vector(2) float
t.c:18:27: note: op template: _142 = PHI <_143(4)>
t.c:18:27: note:stmt 0 _142 = PHI <_143(4)>
t.c:18:27: note:stmt 1 _144 = PHI <_145(4)>
t.c:18:27: note:children 0x420b8e8
t.c:18:27: note: node 0x420b8e8 (max_nunits=2, refcnt=2) vector(2) float
t.c:18:27: note: op template: _143 = PHI <_12(3)>
t.c:18:27: note:stmt 0 _143 = PHI <_12(3)>
t.c:18:27: note:stmt 1 _145 = PHI <_17(3)>
t.c:18:27: note:children 0x420b970
t.c:18:27: note: node 0x420b970 (max_nunits=2, refcnt=2) vector(2) float
t.c:18:27: note: op template: _12 = _11 + pixel$red_80;
t.c:18:27: note:stmt 0 _12 = _11 + pixel$red_80;
t.c:18:27: note:stmt 1 _17 = _16 + pixel$green_82;
t.c:18:27: note:children 0x420b9f8 0x420bca0
t.c:18:27: note: node 0x420b9f8 (max_nunits=2, refcnt=1) vector(2) float
t.c:18:27: note: op template: _11 = _4 * _10;
t.c:18:27: note:stmt 0 _11 = _4 * _10;
t.c:18:27: note:stmt 1 _16 = _4 * _15;
t.c:18:27: note:children 0x420ba80 0x420bb08
t.c:18:27: note: node (external) 0x420ba80 (max_nunits=1, refcnt=1) vector(2)
float
t.c:18:27: note:{ _4, _4 }
t.c:18:27: note: node 0x420bb08 (max_nunits=2, refcnt=1) vector(2) float
t.c:18:27: note: op template: _10 = (float) _9;
t.c:18:27: note:stmt 0 _10 = (float) _9;
t.c:18:27: note:stmt 1 _15 = (float) _14;
t.c:18:27: note:children 0x420bb90
t.c:18:27: note: node (external) 0x420bb90 (max_nunits=2, refcnt=1) vector(2)
int
t.c:18:27: note:stmt 0 _9 = (int) _8;
t.c:18:27: note:stmt 1 _14 = (int) _13;
t.c:18:27: note:children 0x420bc18
t.c:18:27: note: node 0x420bc18 (max_nunits=2, refcnt=1) vector(2) unsigned
char
t.c:18:27: note: op template: _8 = _7->red;
t.c:18:27: note:stmt 0 _8 = _7->red;
t.c:18:27: note:stmt 1 _13 = _7->green;
t.c:18:27: note: node 0x420bca0 (max_nunits=2, refcnt=1) vector(2) float
t.c:18:27: note: op template: pixel$red_80 = PHI <_12(9), pixel$red_79(5)>
t.c:18:27: note:stmt 0 pixel$red_80 = PHI <_12(9), pixel$red_79(5)>
t.c:18:27: note:stmt 1 pixel$green_82 = PHI <_17(9), pixel$green_85(5)>
t.c:18:27: note:children 0x420b970 0x420bd28
t.c:18:27: note: node 0x420bd28 (max_nunits=2, refcnt=1) vector(2) float
t.c:18:27: note: op template: pixel$red_79 = PHI <_143(8), pixel$red_60(D)(7)>
t.c:18:27: note:stmt 0 pixel$red_79 = PHI <_143(8), pixel$red_60(D)(7)>
t.c:18:27: note:stmt 1 pixel$green_85 = PHI <_145(8),
pixel$green_61(D)(7)>
t.c:18:27: note:children 0x420b8e8 0x420bdb0
t.c:18:27: note: node (external) 0x420bdb0 (max_nunits=1, refcnt=1) vector(2)
float
t.c:18:27: note:{ pixel$red_60(D), pixel$green_61(D) }
t.c:18:27: note: node (external) 0x420be38 (max_nunits=1, refcnt=1) vector(2)
float
t.c:18:27: note:{ pixel$red_60(D), pixel$green_61(D) }

But the '(external)' show that we're missing support for some operations:

t.c:18:27: note:   ==> examining statement: _29 = (unsigned char) pixel$red_78;
t.c:18:27: note:   vect_is_simple_use: operand pixel$red_78 = PHI <_142(11),
pixel$red_60(D)(10)>, type of def: internal
t.c:18:27: missed:   conversion not supported by target.
t.c:18:27: note:   vect_is_simple_use: operand pixel$red_78 = PHI <_142(11),
pixel$red_60(D)(10)>, type of def: internal
t.c:18:27: missed:   no optab.
t.c:18:27: missed:   not vectorized: relevant stmt not supported: _29 =
(unsigned char) pixel$red_78;
t.c:18:27: note:   Building vector operands of 0x4215e90 from scalars instead

that's float -> unsigned char

for the stores:

q->red=pixel.red;
q->green=pixel.green;

we then cut the SLP off from that node, 

[Bug tree-optimization/110062] missed vectorization in graphicsmagick

2023-06-07 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062

--- Comment #5 from Jan Hubicka  ---
In sharpening the number of iterations depends on sharpen radius. Not sure what
it is for the benchmark, but in normal situations the number of iterations is
indeed not very large.

However clang simply slp vectorizes the red channels into vector of size
2.

[Bug tree-optimization/110062] missed vectorization in graphicsmagick

2023-06-07 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062

Richard Biener  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |rguenth at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

--- Comment #4 from Richard Biener  ---
So we fail to vectorize the outer loop (with double reduction) because of

t.c:7:25: note:   === vect_analyze_data_ref_accesses ===
t.c:7:25: note:   Detected interleaving load _7->red and _7->green
t.c:7:25: note:   Detected interleaving load _7->red and _7->blue
t.c:7:25: note:   grouped access in outer loop.
t.c:7:25: missed:   not vectorized: complicated access pattern.

for vectorizing the inner loop SLP discovery fails because of a not grouped
load - r[u].{red,green,blue} is handled but k[u] not - I think this is a
well-known
limitation (that ought to be fixed).  We then vectorize the loop with
interleaving and peeling for gaps, but profitability says 'width' needs to
be 16.  We also vectorize the epilog.

I suppose the vectorized body isn't entered?

Note outer loop vectorization likely isn't profitable even if implemented,
so the SLP failure is the thing to fix (which should be easy).  Need to
find the duplicate bug for this.

[Bug tree-optimization/110062] missed vectorization in graphicsmagick

2023-06-06 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062

Jan Hubicka  changed:

   What|Removed |Added

 Status|WAITING |NEW

--- Comment #3 from Jan Hubicka  ---
#include 
struct pixel {float red, green, blue, opacity;};
struct ipixel {unsigned char red, green, blue, opacity;};
test(float *k, struct ipixel *r, int width, int columns, struct ipixel *q)
{
struct pixel pixel;
for (int v=0; v < width; v++)
  {
for (int u=0; u < width; u++)
 {
pixel.red+=k[u]*r[u].red;
pixel.green+=k[u]*r[u].green;
pixel.blue+=k[u]*r[u].blue;
  }   
k+=width;
r+=(size_t) columns+width;
  }
q->red=pixel.red;
q->green=pixel.green;
q->blue=pixel.blue;
q->opacity=255;
}

[Bug tree-optimization/110062] missed vectorization in graphicsmagick

2023-06-02 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062

Richard Biener  changed:

   What|Removed |Added

 CC||rguenth at gcc dot gnu.org
 Status|UNCONFIRMED |WAITING
 Ever confirmed|0   |1
   Last reconfirmed||2023-06-02

--- Comment #2 from Richard Biener  ---
Can you produce a testcase for the loop?

[Bug tree-optimization/110062] missed vectorization in graphicsmagick

2023-06-01 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062

--- Comment #1 from Hongtao.liu  ---
One of the vectorizer issues is related to PR110018.