[Bug tree-optimization/110062] missed vectorization in graphicsmagick
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062 --- Comment #12 from Richard Biener --- (In reply to Jan Hubicka from comment #11) > trunk -O3 -flto -march=native -fopenmp > Operation: Sharpen: > 257 > 256 > 256 > > Average: 256 Iterations Per Minute > GCC13 -O3 -flto -march=native -fopenmp > 257 > 256 > 256 > > Average: 256 Iterations Per Minute > clang17 O3 -flto -march=native -fopenmp >Operation: Sharpen: > 257 > 256 > 256 > Average: 256 Iterations Per Minute > > So I guess I will need to try on zen3 to see if there is any difference. > > the internal loop is: > 0.00 │460:┌─→movzbl 0x2(%rdx,%rax,4),%esi ▒ > 0.02 ││ vmovss (%r8,%rax,4),%xmm2▒ > 0.95 ││ vcvtsi2ss %esi,%xmm0,%xmm1 ▒ > 20.22 ││ movzbl 0x1(%rdx,%rax,4),%esi ▒ > 0.01 ││ vfmadd231ss %xmm1,%xmm2,%xmm3 ▒ > 11.97 ││ vcvtsi2ss %esi,%xmm0,%xmm1 ▒ > 18.76 ││ movzbl (%rdx,%rax,4),%esi▒ > 0.00 ││ inc %rax ▒ > 0.72 ││ vfmadd231ss %xmm1,%xmm2,%xmm4 ▒ > 12.55 ││ vcvtsi2ss %esi,%xmm0,%xmm1 ▒ > 14.95 ││ vfmadd231ss %xmm1,%xmm2,%xmm5 ▒ > 15.93 │├──cmp %rax,%r13 ▒ > 0.35 │└──jne 460 > > > so it still does not get As said the VF is going to be prohibitively large, likely the vector code is never entered and the above is the scalar "epilogue".
[Bug tree-optimization/110062] missed vectorization in graphicsmagick
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062 --- Comment #11 from Jan Hubicka --- trunk -O3 -flto -march=native -fopenmp Operation: Sharpen: 257 256 256 Average: 256 Iterations Per Minute GCC13 -O3 -flto -march=native -fopenmp 257 256 256 Average: 256 Iterations Per Minute clang17 O3 -flto -march=native -fopenmp Operation: Sharpen: 257 256 256 Average: 256 Iterations Per Minute So I guess I will need to try on zen3 to see if there is any difference. the internal loop is: 0.00 │460:┌─→movzbl 0x2(%rdx,%rax,4),%esi ▒ 0.02 ││ vmovss (%r8,%rax,4),%xmm2▒ 0.95 ││ vcvtsi2ss %esi,%xmm0,%xmm1 ▒ 20.22 ││ movzbl 0x1(%rdx,%rax,4),%esi ▒ 0.01 ││ vfmadd231ss %xmm1,%xmm2,%xmm3 ▒ 11.97 ││ vcvtsi2ss %esi,%xmm0,%xmm1 ▒ 18.76 ││ movzbl (%rdx,%rax,4),%esi▒ 0.00 ││ inc %rax ▒ 0.72 ││ vfmadd231ss %xmm1,%xmm2,%xmm4 ▒ 12.55 ││ vcvtsi2ss %esi,%xmm0,%xmm1 ▒ 14.95 ││ vfmadd231ss %xmm1,%xmm2,%xmm5 ▒ 15.93 │├──cmp %rax,%r13 ▒ 0.35 │└──jne 460 so it still does not get
[Bug tree-optimization/110062] missed vectorization in graphicsmagick
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062 --- Comment #10 from Richard Biener --- We now also apply SLP vectorizing the loop, but as said the high VF is probably prohibitive and causes quite some spilling: .L7: vmovdqu (%r14), %ymm2 vmovdqu 32(%r14), %ymm1 subq$-128, %r14 subq$-128, %rdx vmovups -128(%rdx), %ymm10 vmovdqu -64(%r14), %ymm0 vpshufb .LC7(%rip), %ymm2, %ymm4 vmovups -96(%rdx), %ymm9 vmovups -64(%rdx), %ymm8 vpshufb .LC8(%rip), %ymm1, %ymm3 vpermq $78, %ymm4, %ymm4 vpermq $78, %ymm3, %ymm3 ... vmulps %ymm7, %ymm0, %ymm0 vaddps 136(%rsp), %ymm0, %ymm7 vaddps %ymm3, %ymm15, %ymm15 vmovaps %ymm4, 168(%rsp) vmovaps %ymm7, 136(%rsp) cmpq%r13, %r14 jne .L7 Maybe we should more aggressively reject vectorization when the VF is equal to the smallest element number of vector lanes. When we then also detect SLP this usually means BB-level SLP can do something. Note we fail to support V2SF -> V2QI now, not sure what changed here. vectorizable_conversion doesn't support float->int->short->char but only either float->char, float->int->char or float->short->char, but at least for 2-element vectors we don't support these (the vectorizer could support extra intermediate steps as well).
[Bug tree-optimization/110062] missed vectorization in graphicsmagick
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062 --- Comment #9 from Richard Biener --- Note SLPing k[u] won't help to reduce the VF, only selecting a smaller vector size would. The alternative is to have a power-of-two group size by using masking for the 'opacity' field.
[Bug tree-optimization/110062] missed vectorization in graphicsmagick
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062 --- Comment #8 from Richard Biener --- Since r14-2007-g6f19cf7526168f we now vectorize the loop but without SLP which means we get interleaving and a vectorization factor of 64. Turning off loop vectorization yields the following which is now comparable to what clang does. Of course the loop vectorized interleaving is inefficient in the end ... .p2align 4 .p2align 3 .L3: movq%rax, %rdx movq%rdi, %rax .p2align 4 .p2align 3 .L4: vpinsrw $0, (%rax), %xmm0, %xmm0 vmovss (%rdx), %xmm1 movzbl 2(%rax), %ecx addq$4, %rdx addq$4, %rax vpmovzxbd %xmm0, %xmm0 vmovsldup %xmm1, %xmm4 vcvtdq2ps %xmm0, %xmm0 vfmadd231ps %xmm4, %xmm0, %xmm2 vcvtsi2ssl %ecx, %xmm5, %xmm0 vfmadd231ss %xmm0, %xmm1, %xmm3 cmpq%rsi, %rdx jne .L4 incl%r9d movq%r11, %rax addq%rbx, %rdi addq%rbp, %rsi cmpl%r9d, %r10d je .L2 addq%rbp, %r11 jmp .L3 .p2align 4 .p2align 3 .L2: vcvttps2dq %xmm2, %xmm2 vpmovdb %xmm2, %xmm2 popq%rbx .cfi_def_cfa_offset 16 vcvttss2sil %xmm3, %eax popq%rbp .cfi_def_cfa_offset 8 vpextrw $0, %xmm2, (%r8) movb%al, 2(%r8) movb$-1, 3(%r8) ret The loop cost modeling looks like t.c:9:23: note: Cost model analysis: Vector inside of loop cost: 1156 Vector prologue cost: 24 Vector epilogue cost: 5488 Scalar iteration cost: 168 Scalar outside cost: 32 Vector outside cost: 5512 prologue iterations: 0 epilogue iterations: 32 Calculated minimum iters for profitability: 33 t.c:9:23: note:Runtime profitability threshold = 64 t.c:9:23: note:Static estimate profitability threshold = 64 and we get a VF == 32 vectorized epilog as well: t.c:9:23: note: Cost model analysis: Vector inside of loop cost: 620 Vector prologue cost: 12 Vector epilogue cost: 2752 Scalar iteration cost: 168 Scalar outside cost: 32 Vector outside cost: 2764 prologue iterations: 0 epilogue iterations: 16 Calculated minimum iters for profitability: 17 t.c:9:23: note:Runtime profitability threshold = 32 t.c:9:23: note:Static estimate profitability threshold = 32 so at least we'll enter the BB SLP optimized scalar epilog in the likely case.
[Bug tree-optimization/110062] missed vectorization in graphicsmagick
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062 --- Comment #7 from Hongtao.liu --- > pixel$red_60(D)(10)>, type of def: internal > t.c:18:27: missed: no optab. > t.c:18:27: missed: not vectorized: relevant stmt not supported: _29 = > (unsigned char) pixel$red_78; > t.c:18:27: note: Building vector operands of 0x4215e90 from scalars instead > > that's float -> unsigned char > A patch is posted to support vectorization between float and unsigned char https://gcc.gnu.org/pipermail/gcc-patches/2023-June/620402.html
[Bug tree-optimization/110062] missed vectorization in graphicsmagick
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062 --- Comment #6 from Richard Biener --- Btw, we would also be able to vectorize just the red and green channel: t.c:18:27: note: * Analysis succeeded with vector mode V4SF t.c:18:27: note: SLPing BB part t.c:18:27: note: Costing subgraph: t.c:18:27: note: node 0x420b6c8 (max_nunits=2, refcnt=1) vector(2) unsigned char t.c:18:27: note: op template: q_45(D)->red = _29; t.c:18:27: note:stmt 0 q_45(D)->red = _29; t.c:18:27: note:stmt 1 q_45(D)->green = _31; t.c:18:27: note:children 0x420b750 t.c:18:27: note: node (external) 0x420b750 (max_nunits=2, refcnt=1) vector(2) unsigned char t.c:18:27: note:stmt 0 _29 = (unsigned char) pixel$red_78; t.c:18:27: note:stmt 1 _31 = (unsigned char) pixel$green_84; t.c:18:27: note:children 0x420b7d8 t.c:18:27: note: node 0x420b7d8 (max_nunits=2, refcnt=1) vector(2) float t.c:18:27: note: op template: pixel$red_78 = PHI <_142(11), pixel$red_60(D)(10)> t.c:18:27: note:stmt 0 pixel$red_78 = PHI <_142(11), pixel$red_60(D)(10)> t.c:18:27: note:stmt 1 pixel$green_84 = PHI <_144(11), pixel$green_61(D)(10)> t.c:18:27: note:children 0x420b860 0x420be38 t.c:18:27: note: node 0x420b860 (max_nunits=2, refcnt=1) vector(2) float t.c:18:27: note: op template: _142 = PHI <_143(4)> t.c:18:27: note:stmt 0 _142 = PHI <_143(4)> t.c:18:27: note:stmt 1 _144 = PHI <_145(4)> t.c:18:27: note:children 0x420b8e8 t.c:18:27: note: node 0x420b8e8 (max_nunits=2, refcnt=2) vector(2) float t.c:18:27: note: op template: _143 = PHI <_12(3)> t.c:18:27: note:stmt 0 _143 = PHI <_12(3)> t.c:18:27: note:stmt 1 _145 = PHI <_17(3)> t.c:18:27: note:children 0x420b970 t.c:18:27: note: node 0x420b970 (max_nunits=2, refcnt=2) vector(2) float t.c:18:27: note: op template: _12 = _11 + pixel$red_80; t.c:18:27: note:stmt 0 _12 = _11 + pixel$red_80; t.c:18:27: note:stmt 1 _17 = _16 + pixel$green_82; t.c:18:27: note:children 0x420b9f8 0x420bca0 t.c:18:27: note: node 0x420b9f8 (max_nunits=2, refcnt=1) vector(2) float t.c:18:27: note: op template: _11 = _4 * _10; t.c:18:27: note:stmt 0 _11 = _4 * _10; t.c:18:27: note:stmt 1 _16 = _4 * _15; t.c:18:27: note:children 0x420ba80 0x420bb08 t.c:18:27: note: node (external) 0x420ba80 (max_nunits=1, refcnt=1) vector(2) float t.c:18:27: note:{ _4, _4 } t.c:18:27: note: node 0x420bb08 (max_nunits=2, refcnt=1) vector(2) float t.c:18:27: note: op template: _10 = (float) _9; t.c:18:27: note:stmt 0 _10 = (float) _9; t.c:18:27: note:stmt 1 _15 = (float) _14; t.c:18:27: note:children 0x420bb90 t.c:18:27: note: node (external) 0x420bb90 (max_nunits=2, refcnt=1) vector(2) int t.c:18:27: note:stmt 0 _9 = (int) _8; t.c:18:27: note:stmt 1 _14 = (int) _13; t.c:18:27: note:children 0x420bc18 t.c:18:27: note: node 0x420bc18 (max_nunits=2, refcnt=1) vector(2) unsigned char t.c:18:27: note: op template: _8 = _7->red; t.c:18:27: note:stmt 0 _8 = _7->red; t.c:18:27: note:stmt 1 _13 = _7->green; t.c:18:27: note: node 0x420bca0 (max_nunits=2, refcnt=1) vector(2) float t.c:18:27: note: op template: pixel$red_80 = PHI <_12(9), pixel$red_79(5)> t.c:18:27: note:stmt 0 pixel$red_80 = PHI <_12(9), pixel$red_79(5)> t.c:18:27: note:stmt 1 pixel$green_82 = PHI <_17(9), pixel$green_85(5)> t.c:18:27: note:children 0x420b970 0x420bd28 t.c:18:27: note: node 0x420bd28 (max_nunits=2, refcnt=1) vector(2) float t.c:18:27: note: op template: pixel$red_79 = PHI <_143(8), pixel$red_60(D)(7)> t.c:18:27: note:stmt 0 pixel$red_79 = PHI <_143(8), pixel$red_60(D)(7)> t.c:18:27: note:stmt 1 pixel$green_85 = PHI <_145(8), pixel$green_61(D)(7)> t.c:18:27: note:children 0x420b8e8 0x420bdb0 t.c:18:27: note: node (external) 0x420bdb0 (max_nunits=1, refcnt=1) vector(2) float t.c:18:27: note:{ pixel$red_60(D), pixel$green_61(D) } t.c:18:27: note: node (external) 0x420be38 (max_nunits=1, refcnt=1) vector(2) float t.c:18:27: note:{ pixel$red_60(D), pixel$green_61(D) } But the '(external)' show that we're missing support for some operations: t.c:18:27: note: ==> examining statement: _29 = (unsigned char) pixel$red_78; t.c:18:27: note: vect_is_simple_use: operand pixel$red_78 = PHI <_142(11), pixel$red_60(D)(10)>, type of def: internal t.c:18:27: missed: conversion not supported by target. t.c:18:27: note: vect_is_simple_use: operand pixel$red_78 = PHI <_142(11), pixel$red_60(D)(10)>, type of def: internal t.c:18:27: missed: no optab. t.c:18:27: missed: not vectorized: relevant stmt not supported: _29 = (unsigned char) pixel$red_78; t.c:18:27: note: Building vector operands of 0x4215e90 from scalars instead that's float -> unsigned char for the stores: q->red=pixel.red; q->green=pixel.green; we then cut the SLP off from that node,
[Bug tree-optimization/110062] missed vectorization in graphicsmagick
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062 --- Comment #5 from Jan Hubicka --- In sharpening the number of iterations depends on sharpen radius. Not sure what it is for the benchmark, but in normal situations the number of iterations is indeed not very large. However clang simply slp vectorizes the red channels into vector of size 2.
[Bug tree-optimization/110062] missed vectorization in graphicsmagick
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062 Richard Biener changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org Status|NEW |ASSIGNED --- Comment #4 from Richard Biener --- So we fail to vectorize the outer loop (with double reduction) because of t.c:7:25: note: === vect_analyze_data_ref_accesses === t.c:7:25: note: Detected interleaving load _7->red and _7->green t.c:7:25: note: Detected interleaving load _7->red and _7->blue t.c:7:25: note: grouped access in outer loop. t.c:7:25: missed: not vectorized: complicated access pattern. for vectorizing the inner loop SLP discovery fails because of a not grouped load - r[u].{red,green,blue} is handled but k[u] not - I think this is a well-known limitation (that ought to be fixed). We then vectorize the loop with interleaving and peeling for gaps, but profitability says 'width' needs to be 16. We also vectorize the epilog. I suppose the vectorized body isn't entered? Note outer loop vectorization likely isn't profitable even if implemented, so the SLP failure is the thing to fix (which should be easy). Need to find the duplicate bug for this.
[Bug tree-optimization/110062] missed vectorization in graphicsmagick
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062 Jan Hubicka changed: What|Removed |Added Status|WAITING |NEW --- Comment #3 from Jan Hubicka --- #include struct pixel {float red, green, blue, opacity;}; struct ipixel {unsigned char red, green, blue, opacity;}; test(float *k, struct ipixel *r, int width, int columns, struct ipixel *q) { struct pixel pixel; for (int v=0; v < width; v++) { for (int u=0; u < width; u++) { pixel.red+=k[u]*r[u].red; pixel.green+=k[u]*r[u].green; pixel.blue+=k[u]*r[u].blue; } k+=width; r+=(size_t) columns+width; } q->red=pixel.red; q->green=pixel.green; q->blue=pixel.blue; q->opacity=255; }
[Bug tree-optimization/110062] missed vectorization in graphicsmagick
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062 Richard Biener changed: What|Removed |Added CC||rguenth at gcc dot gnu.org Status|UNCONFIRMED |WAITING Ever confirmed|0 |1 Last reconfirmed||2023-06-02 --- Comment #2 from Richard Biener --- Can you produce a testcase for the loop?
[Bug tree-optimization/110062] missed vectorization in graphicsmagick
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062 --- Comment #1 from Hongtao.liu --- One of the vectorizer issues is related to PR110018.