[Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797 Andrew Pinski changed: What|Removed |Added Target Milestone|--- |12.0
[Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797 Richard Biener changed: What|Removed |Added Status|NEW |RESOLVED Version|unknown |12.0 Resolution|--- |FIXED --- Comment #20 from Richard Biener --- The division is now vectorized, your short testcase produces t: .LFB0: .cfi_startproc movss f(%rip), %xmm4 movss test+8(%rip), %xmm3 movqtest(%rip), %xmm0 mulss %xmm4, %xmm3 movaps %xmm4, %xmm1 shufps $0xe0, %xmm1, %xmm1 mulps %xmm1, %xmm0 movhps .LC0(%rip), %xmm1 rcpps %xmm1, %xmm2 sqrtss %xmm3, %xmm3 mulps %xmm2, %xmm1 sqrtps %xmm0, %xmm0 divss %xmm4, %xmm3 mulps %xmm2, %xmm1 addps %xmm2, %xmm2 subps %xmm1, %xmm2 mulps %xmm2, %xmm0 movlps %xmm0, test(%rip) movss %xmm3, test+8(%rip) ret
[Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797 Jan Hubicka changed: What|Removed |Added See Also||https://gcc.gnu.org/bugzill ||a/show_bug.cgi?id=88602 --- Comment #19 from Jan Hubicka --- It turns out that there is yeat another #ifdef __clang__ I missed in the gfx library, so the vectorised code produced by clang is hand written in the extensions discussed in PR88602. Sorry for confusion. However I think the simplified testcase is still perfectly vectorisable and we should add the div patterns as discussed.
[Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797 --- Comment #18 from CVS Commits --- The master branch has been updated by Uros Bizjak : https://gcc.gnu.org/g:8f921393e339090566c1589d81009caa954de90d commit r12-6113-g8f921393e339090566c1589d81009caa954de90d Author: Uros Bizjak Date: Fri Dec 24 17:09:36 2021 +0100 i386: Add V2SFmode DIV insn pattern [PR95046, PR103797] Use V4SFmode "DIVPS X,Y" with [y0, y1, 1.0f, 1.0f] as a divisor to avoid division by zero. 2021-12-24 Uroš Bizjak gcc/ChangeLog: PR target/95046 PR target/103797 * config/i386/mmx.md (divv2sf3): New instruction pattern. gcc/testsuite/ChangeLog: PR target/95046 PR target/103797 * gcc.target/i386/pr95046-1.c (test_div): Add. (dg-options): Add -mno-recip.
[Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797 --- Comment #17 from Uroš Bizjak --- (In reply to hubicka from comment #16) > > > > > > It could be done, but I was under impression that the sequence to load > > > 1.0f > > > into topmost elements nullifies the benefit of operation to divide two > > > > Sure, so perhaps we should somewhat increase the vectorization cost of > > V2SFmode > > division so that we would use it only if it is part of longer sequences? > > I wonder how the hardware implements it. If divps is of similar latency > as divss then I guess it is essentially always win to load 1.0 to the > upper part, since it is slow operation. On the other hand if divps is > about 4 times divss, then this may be harmful. > > Agner Fog seems to be listing divss and divps with same latencies. > For zen it is 10 cycles which should be enough to do the setup. OK, I'll prepare and test a formal patch.
[Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797 --- Comment #16 from hubicka at kam dot mff.cuni.cz --- > > > > It could be done, but I was under impression that the sequence to load 1.0f > > into topmost elements nullifies the benefit of operation to divide two > > Sure, so perhaps we should somewhat increase the vectorization cost of > V2SFmode > division so that we would use it only if it is part of longer sequences? I wonder how the hardware implements it. If divps is of similar latency as divss then I guess it is essentially always win to load 1.0 to the upper part, since it is slow operation. On the other hand if divps is about 4 times divss, then this may be harmful. Agner Fog seems to be listing divss and divps with same latencies. For zen it is 10 cycles which should be enough to do the setup.
[Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797 --- Comment #15 from Jakub Jelinek --- (In reply to Uroš Bizjak from comment #12) > (In reply to Jakub Jelinek from comment #10) > > At least on your short testcase clang doesn't use divps either. > > We do support mulv2sf3, addv2sf3 etc. but not divv2sf3 I bet because with > > TARGET_MMX_WITH_SSE it would divide by zero in the 3rd and 4th elts, > > but perhaps we could insert 1.0f, 1.0f into those elements of the divisor > > before using divps? > > It could be done, but I was under impression that the sequence to load 1.0f > into topmost elements nullifies the benefit of operation to divide two Sure, so perhaps we should somewhat increase the vectorization cost of V2SFmode division so that we would use it only if it is part of longer sequences?
[Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797 --- Comment #14 from Uroš Bizjak --- (In reply to Uroš Bizjak from comment #13) > Created attachment 52051 [details] > Patch that implements v2sf division This patch also enables vectorization of the testcase from Comment #7. Using -ffast-math, it also generates vectorized reciprocal: movss f(%rip), %xmm4 movss test+8(%rip), %xmm3 movqtest(%rip), %xmm2 mulss %xmm4, %xmm3 movaps %xmm4, %xmm0 shufps $0xe0, %xmm0, %xmm0 mulps %xmm0, %xmm2 movhps .LC0(%rip), %xmm0 --> rcpps %xmm0, %xmm1 sqrtss %xmm3, %xmm3 mulps %xmm1, %xmm0 sqrtps %xmm2, %xmm2 divss %xmm4, %xmm3 movaps %xmm2, %xmm5 mulps %xmm1, %xmm0 addps %xmm1, %xmm1 subps %xmm0, %xmm1 mulps %xmm1, %xmm5 movlps %xmm5, test(%rip) movss %xmm3, test+8(%rip) ret
[Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797 --- Comment #13 from Uroš Bizjak --- Created attachment 52051 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52051=edit Patch that implements v2sf division Please try the attached patch, for the following testcase: --cut here-- float a[2], b[2], r[2]; void bar (void) { int i; for (i = 0; i < 2; i++) r[i] = a[i] / b[i]; } --cut here-- the compiler generates: movqb(%rip), %xmm1 movqa(%rip), %xmm0 movhps .LC0(%rip), %xmm1 divps %xmm1, %xmm0 movlps %xmm0, r(%rip) ret
[Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797 --- Comment #12 from Uroš Bizjak --- (In reply to Jakub Jelinek from comment #10) > At least on your short testcase clang doesn't use divps either. > We do support mulv2sf3, addv2sf3 etc. but not divv2sf3 I bet because with > TARGET_MMX_WITH_SSE it would divide by zero in the 3rd and 4th elts, > but perhaps we could insert 1.0f, 1.0f into those elements of the divisor > before using divps? It could be done, but I was under impression that the sequence to load 1.0f into topmost elements nullifies the benefit of operation to divide two elements. However, if the missing pattern prevents longer vectorized chains, this is not entirely true. The division can be implemented in the same way as sse_cvtps2pi, but using CONST1_RTX vector.
[Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797 --- Comment #11 from Jan Hubicka --- Aha, I did not noticed that we need special patterns (I extecpted this is problem to solve in machine independent code). So I guess we have 1) SLP should vectorize the 3 accesses with -ffast-math to only one vector operation (as opposed to one vector+one scalar it does now) 2) we could adddivv2sf3 pattern which initializes the elt 4 of the operand2 to 1.0f to avoid funny results 3) we need to figure out why SLP vectorization is not even considered in the original testcase (which I do not seem to be able to dig out with reasonable effort in a way that it preserves original properties - to be vectorized by clang and not vectorized by gcc)
[Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797 Jakub Jelinek changed: What|Removed |Added CC||jakub at gcc dot gnu.org, ||uros at gcc dot gnu.org --- Comment #10 from Jakub Jelinek --- At least on your short testcase clang doesn't use divps either. We do support mulv2sf3, addv2sf3 etc. but not divv2sf3 I bet because with TARGET_MMX_WITH_SSE it would divide by zero in the 3rd and 4th elts, but perhaps we could insert 1.0f, 1.0f into those elements of the divisor before using divps? Another question is if we could teach SLP to vectorize even factors not power of 2, say loads/stores could be done (and with e.g. AVX512 almost everything) could be done with masked loads/stores, most arithmetics could be done normally and we'd just need to watch what values we'll get in the extra elts and make sure it doesn't generate exceptions etc.
[Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797 --- Comment #9 from hubicka at kam dot mff.cuni.cz --- > recip pass happens after vectorization > I don't know/understand why though. Yep, I suppose we want to either special case this in vectorizer or make it earlier... I also wonder why the code is vectorized for pairs of values and third one is computed separately and why we don't use vectors of length 4...
[Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797 --- Comment #8 from Andrew Pinski --- (In reply to Jan Hubicka from comment #7) > Having this however I do not see slp analyzing the divide in the original > code at all. recip pass happens after vectorization I don't know/understand why though.
[Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797 Jan Hubicka changed: What|Removed |Added Status|WAITING |NEW --- Comment #7 from Jan Hubicka --- OK, here is completely fake testcase that does similar operaitons: #include struct test {float x; float y; float z;} test; float f; void t() { float x = test.x; float y = test.y; float z = test.z; x = x * f; y = y * f; z = z * f; x = sqrt (x); y = sqrt (y); z = sqrt (z); x = x / f; y = y / f; z = z / f; test.x=x; test.y=y; test.z=z; } We seem to fail to vectorize it with: t.c:20:9: missed: op not supported by target. t.c:17:5: missed: not vectorized: relevant stmt not supported: x_15 = x_24 / f.0_1; clang seems to use divps happilly, so I am not sure why it is not supported. Even more funny is that with -Ofast it is compiled into multiplication by reciprocal: t: .LFB0: .cfi_startproc movss f(%rip), %xmm4 movss .LC0(%rip), %xmm2 movss test(%rip), %xmm0 movss test+4(%rip), %xmm3 divss %xmm4, %xmm2 movss test+8(%rip), %xmm1 mulss %xmm4, %xmm0 mulss %xmm4, %xmm3 mulss %xmm4, %xmm1 sqrtss %xmm0, %xmm0 sqrtss %xmm3, %xmm3 sqrtss %xmm1, %xmm1 mulss %xmm2, %xmm0 mulss %xmm2, %xmm3 mulss %xmm2, %xmm1 unpcklps%xmm3, %xmm0 movlps %xmm0, test(%rip) movss %xmm1, test+8(%rip) ret and rewriting it that way by hand: #include struct test {float x; float y; float z;} test; float f; void t() { float x = test.x; float y = test.y; float z = test.z; float m = 1/f; x = x * f; y = y * f; z = z * f; x = sqrt (x); y = sqrt (y); z = sqrt (z); x = x * m; y = y * m; z = z * m; test.x=x; test.y=y; test.z=z; } gets the expected result: t: .LFB0: .cfi_startproc movss f(%rip), %xmm0 movqtest(%rip), %xmm1 movaps %xmm0, %xmm2 shufps $0xe0, %xmm2, %xmm2 mulps %xmm1, %xmm2 movss .LC0(%rip), %xmm1 divss %xmm0, %xmm1 mulss test+8(%rip), %xmm0 sqrtps %xmm2, %xmm2 sqrtss %xmm0, %xmm0 movaps %xmm1, %xmm3 shufps $0xe0, %xmm3, %xmm3 mulss %xmm0, %xmm1 mulps %xmm3, %xmm2 movss %xmm1, test+8(%rip) movlps %xmm2, test(%rip) ret .cfi_endproc Having this however I do not see slp analyzing the divide in the original code at all.
[Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797 --- Comment #6 from Martin Liška --- You may try exporting GIMPLE IL that can be consumed with -fgimple.
[Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797 --- Comment #5 from hubicka at kam dot mff.cuni.cz --- Created attachment 52042 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52042=edit b.slp1
[Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797 --- Comment #4 from hubicka at kam dot mff.cuni.cz --- > -E and remove not needed code. > > > The > > declaratoins are quite convoluted, but the function is well isolated and > > easy to inspect from full one... > > Do we speak about: > https://github.com/mozilla/gecko-dev/blob/bd25b1ca76dd5d323ffc69557f6cf759ba76ba23/gfx/2d/FilterNodeSoftware.cpp#L3670-L3691 > ? Yes. > > It should be possible creating a synthetical test that does the same (and > lives > in a loop, right?). Well, I tried that for a while and got bit lost (either code got vectorized by both gcc and clang or by neither). There are more issues where we have over 50% regression wrt clang build at gfx code, so I think I will first try to reproduce those locally and perf them to see if there is more pattern here. The releavant code is: uint32_t mozilla::gfx::{anonymous}::SpecularLightingSoftware::LightPixel (struct SpecularLightingSoftware * const this, const struct Point3D & aNormal, const struct Point3D & aVectorToLight, uint32_t aColor) { [local count: 118111600]: _48 = MEM[(const struct BasePoint3D *)aVectorToLight_25(D)].D.75826.D.75829.z; _49 = _48 + 1.0e+0; _50 = MEM[(const struct BasePoint3D *)aVectorToLight_25(D)].D.75826.D.75829.y; _51 = _50 + 0.0; _52 = MEM[(const struct BasePoint3D *)aVectorToLight_25(D)].D.75826.D.75829.x; _53 = _52 + 0.0; _80 = _53 * _53; _82 = _51 * _51; _83 = _80 + _82; _85 = _49 * _49; _86 = _83 + _85; if (_86 u>= 0.0) goto ; [99.95%] else goto ; [0.05%] [local count: 118052545]: _87 = .SQRT (_86); goto ; [100.00%] [local count: 59055]: _29 = __builtin_sqrtf (_86); [local count: 118111600]: # _30 = PHI <_29(4), _87(3)> _88 = _53 / _30; _89 = _51 / _30; _90 = _49 / _30; _41 = MEM[(const struct BasePoint3D *)aNormal_26(D)].D.75826.D.75829.x; _39 = _41 * _88; _37 = MEM[(const struct BasePoint3D *)aNormal_26(D)].D.75826.D.75829.y; _33 = _37 * _89; _27 = _33 + _39; _45 = MEM[(const struct BasePoint3D *)aNormal_26(D)].D.75826.D.75829.z; _46 = _45 * _90; _47 = _27 + _46; if (_47 >= 0.0) goto ; [59.00%] else goto ; [41.00%] With -Ofast it gets bit more streamlined: [local count: 118111600]: _48 = MEM[(const struct BasePoint3D *)aVectorToLight_25(D)].D.75826.D.75829.z; _49 = _48 + 1.0e+0; _50 = MEM[(const struct BasePoint3D *)aVectorToLight_25(D)].D.75826.D.75829.y; _51 = MEM[(const struct BasePoint3D *)aVectorToLight_25(D)].D.75826.D.75829.x; powmult_78 = _51 * _51; powmult_80 = _50 * _50; _81 = powmult_78 + powmult_80; powmult_83 = _49 * _49; _84 = _81 + powmult_83; _85 = __builtin_sqrtf (_84); _86 = _51 / _85; _87 = _50 / _85; _88 = _49 / _85; _41 = MEM[(const struct BasePoint3D *)aNormal_26(D)].D.75826.D.75829.x; _39 = _41 * _86; _37 = MEM[(const struct BasePoint3D *)aNormal_26(D)].D.75826.D.75829.y; _33 = _37 * _87; _27 = _33 + _39; _45 = MEM[(const struct BasePoint3D *)aNormal_26(D)].D.75826.D.75829.z; _46 = _45 * _88; _47 = _27 + _46; if (_47 >= 0.0) goto ; [59.00%] else goto ; [41.00%] But I do not quite see in the slp dump why this is not considered for vectorization. I attach the dump. Honza
[Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797 --- Comment #3 from Martin Liška --- (In reply to hubicka from comment #2) > > Can you please attach a reduced test-case? > Do you know how to produce one with a reasonable effort? -E and remove not needed code. > The > declaratoins are quite convoluted, but the function is well isolated and > easy to inspect from full one... Do we speak about: https://github.com/mozilla/gecko-dev/blob/bd25b1ca76dd5d323ffc69557f6cf759ba76ba23/gfx/2d/FilterNodeSoftware.cpp#L3670-L3691 ? It should be possible creating a synthetical test that does the same (and lives in a loop, right?).
[Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797 --- Comment #2 from hubicka at kam dot mff.cuni.cz --- > Can you please attach a reduced test-case? Do you know how to produce one with a reasonable effort? The declaratoins are quite convoluted, but the function is well isolated and easy to inspect from full one...
[Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797 Martin Liška changed: What|Removed |Added Status|UNCONFIRMED |WAITING Ever confirmed|0 |1 CC||marxin at gcc dot gnu.org Last reconfirmed||2021-12-22 --- Comment #1 from Martin Liška --- Can you please attach a reduced test-case?
[Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797 Andrew Pinski changed: What|Removed |Added Keywords||missed-optimization Severity|normal |enhancement