[Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not

2022-01-06 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797

Andrew Pinski  changed:

   What|Removed |Added

   Target Milestone|--- |12.0

[Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not

2022-01-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797

Richard Biener  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
Version|unknown |12.0
 Resolution|--- |FIXED

--- Comment #20 from Richard Biener  ---
The division is now vectorized, your short testcase produces

t:
.LFB0:
.cfi_startproc
movss   f(%rip), %xmm4
movss   test+8(%rip), %xmm3
movqtest(%rip), %xmm0
mulss   %xmm4, %xmm3
movaps  %xmm4, %xmm1
shufps  $0xe0, %xmm1, %xmm1
mulps   %xmm1, %xmm0
movhps  .LC0(%rip), %xmm1
rcpps   %xmm1, %xmm2
sqrtss  %xmm3, %xmm3
mulps   %xmm2, %xmm1
sqrtps  %xmm0, %xmm0
divss   %xmm4, %xmm3
mulps   %xmm2, %xmm1
addps   %xmm2, %xmm2
subps   %xmm1, %xmm2
mulps   %xmm2, %xmm0
movlps  %xmm0, test(%rip)
movss   %xmm3, test+8(%rip)
ret

[Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not

2022-01-03 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797

Jan Hubicka  changed:

   What|Removed |Added

   See Also||https://gcc.gnu.org/bugzill
   ||a/show_bug.cgi?id=88602

--- Comment #19 from Jan Hubicka  ---
It turns out that there is yeat another #ifdef __clang__ I missed in the gfx
library, so the vectorised code produced by clang is hand written in the
extensions discussed in PR88602.

Sorry for confusion. However I think the simplified testcase is still perfectly
vectorisable and we should add the div patterns as discussed.

[Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not

2021-12-24 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797

--- Comment #18 from CVS Commits  ---
The master branch has been updated by Uros Bizjak :

https://gcc.gnu.org/g:8f921393e339090566c1589d81009caa954de90d

commit r12-6113-g8f921393e339090566c1589d81009caa954de90d
Author: Uros Bizjak 
Date:   Fri Dec 24 17:09:36 2021 +0100

i386: Add V2SFmode DIV insn pattern [PR95046, PR103797]

Use V4SFmode "DIVPS X,Y" with [y0, y1, 1.0f, 1.0f] as a divisor
to avoid division by zero.

2021-12-24  Uroš Bizjak  

gcc/ChangeLog:

PR target/95046
PR target/103797
* config/i386/mmx.md (divv2sf3): New instruction pattern.

gcc/testsuite/ChangeLog:

PR target/95046
PR target/103797
* gcc.target/i386/pr95046-1.c (test_div): Add.
(dg-options): Add -mno-recip.

[Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not

2021-12-23 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797

--- Comment #17 from Uroš Bizjak  ---
(In reply to hubicka from comment #16)
> > > 
> > > It could be done, but I was under impression that the sequence to load 
> > > 1.0f
> > > into topmost elements nullifies the benefit of operation to divide two
> > 
> > Sure, so perhaps we should somewhat increase the vectorization cost of 
> > V2SFmode
> > division so that we would use it only if it is part of longer sequences?
> 
> I wonder how the hardware implements it.  If divps is of similar latency
> as divss then I guess it is essentially always win to load 1.0 to the
> upper part, since it is slow operation.  On the other hand if divps is
> about 4 times divss, then this may be harmful.
> 
> Agner Fog seems to be listing divss and divps with same latencies.
> For zen it is 10 cycles which should be enough to do the setup.

OK, I'll prepare and test a formal patch.

[Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not

2021-12-23 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797

--- Comment #16 from hubicka at kam dot mff.cuni.cz ---
> > 
> > It could be done, but I was under impression that the sequence to load 1.0f
> > into topmost elements nullifies the benefit of operation to divide two
> 
> Sure, so perhaps we should somewhat increase the vectorization cost of 
> V2SFmode
> division so that we would use it only if it is part of longer sequences?

I wonder how the hardware implements it.  If divps is of similar latency
as divss then I guess it is essentially always win to load 1.0 to the
upper part, since it is slow operation.  On the other hand if divps is
about 4 times divss, then this may be harmful.

Agner Fog seems to be listing divss and divps with same latencies.
For zen it is 10 cycles which should be enough to do the setup.

[Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not

2021-12-23 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797

--- Comment #15 from Jakub Jelinek  ---
(In reply to Uroš Bizjak from comment #12)
> (In reply to Jakub Jelinek from comment #10)
> > At least on your short testcase clang doesn't use divps either.
> > We do support mulv2sf3, addv2sf3 etc. but not divv2sf3 I bet because with
> > TARGET_MMX_WITH_SSE it would divide by zero in the 3rd and 4th elts,
> > but perhaps we could insert 1.0f, 1.0f into those elements of the divisor
> > before using divps?
> 
> It could be done, but I was under impression that the sequence to load 1.0f
> into topmost elements nullifies the benefit of operation to divide two

Sure, so perhaps we should somewhat increase the vectorization cost of V2SFmode
division so that we would use it only if it is part of longer sequences?

[Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not

2021-12-23 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797

--- Comment #14 from Uroš Bizjak  ---
(In reply to Uroš Bizjak from comment #13)
> Created attachment 52051 [details]
> Patch that implements v2sf division

This patch also enables vectorization of the testcase from Comment #7. Using
-ffast-math, it also generates vectorized reciprocal:

movss   f(%rip), %xmm4
movss   test+8(%rip), %xmm3
movqtest(%rip), %xmm2
mulss   %xmm4, %xmm3
movaps  %xmm4, %xmm0
shufps  $0xe0, %xmm0, %xmm0
mulps   %xmm0, %xmm2
movhps  .LC0(%rip), %xmm0
--> rcpps   %xmm0, %xmm1
sqrtss  %xmm3, %xmm3
mulps   %xmm1, %xmm0
sqrtps  %xmm2, %xmm2
divss   %xmm4, %xmm3
movaps  %xmm2, %xmm5
mulps   %xmm1, %xmm0
addps   %xmm1, %xmm1
subps   %xmm0, %xmm1
mulps   %xmm1, %xmm5
movlps  %xmm5, test(%rip)
movss   %xmm3, test+8(%rip)
ret

[Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not

2021-12-23 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797

--- Comment #13 from Uroš Bizjak  ---
Created attachment 52051
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52051=edit
Patch that implements v2sf division

Please try the attached patch, for the following testcase:

--cut here--
float a[2], b[2], r[2];

void bar (void)
{
  int i;

  for (i = 0; i < 2; i++)
r[i] = a[i] / b[i];
}
--cut here--

the compiler generates:

movqb(%rip), %xmm1
movqa(%rip), %xmm0
movhps  .LC0(%rip), %xmm1
divps   %xmm1, %xmm0
movlps  %xmm0, r(%rip)
ret

[Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not

2021-12-23 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797

--- Comment #12 from Uroš Bizjak  ---
(In reply to Jakub Jelinek from comment #10)
> At least on your short testcase clang doesn't use divps either.
> We do support mulv2sf3, addv2sf3 etc. but not divv2sf3 I bet because with
> TARGET_MMX_WITH_SSE it would divide by zero in the 3rd and 4th elts,
> but perhaps we could insert 1.0f, 1.0f into those elements of the divisor
> before using divps?

It could be done, but I was under impression that the sequence to load 1.0f
into topmost elements nullifies the benefit of operation to divide two
elements.  However, if the missing pattern prevents longer vectorized chains,
this is not entirely true.

The division can be implemented in the same way as sse_cvtps2pi, but using
CONST1_RTX vector.

[Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not

2021-12-22 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797

--- Comment #11 from Jan Hubicka  ---
Aha, I did not noticed that we need special patterns (I extecpted this is
problem to solve in machine independent code).  So I guess we have
 1) SLP should vectorize the 3 accesses with -ffast-math to only one vector
operation (as opposed to one vector+one scalar it does now)
 2) we could adddivv2sf3 pattern which initializes the elt 4 of the operand2 to
1.0f to avoid funny results
 3) we need to figure out why SLP vectorization is not even considered in the
original testcase (which I do not seem to be able to dig out with reasonable
effort in a way that it preserves original properties - to be vectorized by
clang and not vectorized by gcc)

[Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not

2021-12-22 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797

Jakub Jelinek  changed:

   What|Removed |Added

 CC||jakub at gcc dot gnu.org,
   ||uros at gcc dot gnu.org

--- Comment #10 from Jakub Jelinek  ---
At least on your short testcase clang doesn't use divps either.
We do support mulv2sf3, addv2sf3 etc. but not divv2sf3 I bet because with
TARGET_MMX_WITH_SSE it would divide by zero in the 3rd and 4th elts,
but perhaps we could insert 1.0f, 1.0f into those elements of the divisor
before using divps?

Another question is if we could teach SLP to vectorize even factors not power
of 2, say loads/stores could be done (and with e.g. AVX512 almost everything)
could be done with masked loads/stores, most arithmetics could be done normally
and we'd just need to watch what values we'll get in the extra elts and make
sure it doesn't generate exceptions etc.

[Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not

2021-12-22 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797

--- Comment #9 from hubicka at kam dot mff.cuni.cz ---
> recip pass happens after vectorization 
> I don't know/understand why though.
Yep, I suppose we want to either special case this in vectorizer or make
it earlier...  I also wonder why the code is vectorized for pairs of
values and third one is computed separately and why we don't use vectors
of length 4...

[Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not

2021-12-22 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797

--- Comment #8 from Andrew Pinski  ---
(In reply to Jan Hubicka from comment #7)
> Having this however I do not see slp analyzing the divide in the original
> code at all.

recip pass happens after vectorization 
I don't know/understand why though.

[Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not

2021-12-22 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797

Jan Hubicka  changed:

   What|Removed |Added

 Status|WAITING |NEW

--- Comment #7 from Jan Hubicka  ---
OK, here is completely fake testcase that does similar operaitons:

#include 
struct test {float x; float y; float z;} test;
float f;
void
t()
{
  float x = test.x;
  float y = test.y;
  float z = test.z;

  x = x * f;
  y = y * f;
  z = z * f;
  x = sqrt (x);
  y = sqrt (y);
  z = sqrt (z);
  x = x / f;
  y = y / f;
  z = z / f;
  test.x=x;
  test.y=y;
  test.z=z;
}

We seem to fail to vectorize it with:

t.c:20:9: missed:   op not supported by target. 
t.c:17:5: missed:   not vectorized: relevant stmt not supported: x_15 = x_24 /
f.0_1;

clang seems to use divps happilly, so I am not sure why it is not supported.
Even more funny is that with -Ofast it is compiled into multiplication by
reciprocal:

t:
.LFB0:
.cfi_startproc
movss   f(%rip), %xmm4
movss   .LC0(%rip), %xmm2
movss   test(%rip), %xmm0
movss   test+4(%rip), %xmm3
divss   %xmm4, %xmm2
movss   test+8(%rip), %xmm1
mulss   %xmm4, %xmm0
mulss   %xmm4, %xmm3
mulss   %xmm4, %xmm1
sqrtss  %xmm0, %xmm0
sqrtss  %xmm3, %xmm3
sqrtss  %xmm1, %xmm1
mulss   %xmm2, %xmm0
mulss   %xmm2, %xmm3
mulss   %xmm2, %xmm1
unpcklps%xmm3, %xmm0
movlps  %xmm0, test(%rip)
movss   %xmm1, test+8(%rip)
ret


and rewriting it that way by hand:

#include 
struct test {float x; float y; float z;} test;
float f;
void
t()
{
  float x = test.x;
  float y = test.y;
  float z = test.z;
  float m = 1/f;

  x = x * f;
  y = y * f;
  z = z * f;
  x = sqrt (x);
  y = sqrt (y);
  z = sqrt (z);
  x = x * m;
  y = y * m;
  z = z * m;
  test.x=x;
  test.y=y;
  test.z=z;
}

gets the expected result:
t:
.LFB0:
.cfi_startproc
movss   f(%rip), %xmm0
movqtest(%rip), %xmm1
movaps  %xmm0, %xmm2
shufps  $0xe0, %xmm2, %xmm2
mulps   %xmm1, %xmm2
movss   .LC0(%rip), %xmm1
divss   %xmm0, %xmm1
mulss   test+8(%rip), %xmm0
sqrtps  %xmm2, %xmm2
sqrtss  %xmm0, %xmm0
movaps  %xmm1, %xmm3
shufps  $0xe0, %xmm3, %xmm3
mulss   %xmm0, %xmm1
mulps   %xmm3, %xmm2
movss   %xmm1, test+8(%rip)
movlps  %xmm2, test(%rip)
ret
.cfi_endproc

Having this however I do not see slp analyzing the divide in the original code
at all.

[Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not

2021-12-22 Thread marxin at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797

--- Comment #6 from Martin Liška  ---
You may try exporting GIMPLE IL that can be consumed with -fgimple.

[Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not

2021-12-22 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797

--- Comment #5 from hubicka at kam dot mff.cuni.cz ---
Created attachment 52042
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52042=edit
b.slp1

[Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not

2021-12-22 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797

--- Comment #4 from hubicka at kam dot mff.cuni.cz ---
> -E and remove not needed code.
> 
> > The
> > declaratoins are quite convoluted, but the function is well isolated and
> > easy to inspect from full one...
> 
> Do we speak about:
> https://github.com/mozilla/gecko-dev/blob/bd25b1ca76dd5d323ffc69557f6cf759ba76ba23/gfx/2d/FilterNodeSoftware.cpp#L3670-L3691
> ?
Yes.
> 
> It should be possible creating a synthetical test that does the same (and 
> lives
> in a loop, right?).

Well, I tried that for a while and got bit lost (either code got
vectorized by both gcc and clang or by neither).  There are more issues
where we have over 50% regression wrt clang build at gfx code, so I
think I will first try to reproduce those locally and perf them to see
if there is more pattern here.

The releavant code is:

uint32_t mozilla::gfx::{anonymous}::SpecularLightingSoftware::LightPixel
(struct SpecularLightingSoftware * const this, const struct Point3D & aNormal,
const struct Point3D & aVectorToLight, uint32_t aColor)
{

   [local count: 118111600]:
  _48 = MEM[(const struct BasePoint3D
*)aVectorToLight_25(D)].D.75826.D.75829.z;
  _49 = _48 + 1.0e+0;
  _50 = MEM[(const struct BasePoint3D
*)aVectorToLight_25(D)].D.75826.D.75829.y;
  _51 = _50 + 0.0;
  _52 = MEM[(const struct BasePoint3D
*)aVectorToLight_25(D)].D.75826.D.75829.x;
  _53 = _52 + 0.0;
  _80 = _53 * _53;
  _82 = _51 * _51;
  _83 = _80 + _82;
  _85 = _49 * _49;
  _86 = _83 + _85;
  if (_86 u>= 0.0)
goto ; [99.95%]
  else
goto ; [0.05%]

   [local count: 118052545]:
  _87 = .SQRT (_86);
  goto ; [100.00%]

   [local count: 59055]:
  _29 = __builtin_sqrtf (_86);

   [local count: 118111600]:
  # _30 = PHI <_29(4), _87(3)>
  _88 = _53 / _30;
  _89 = _51 / _30;
  _90 = _49 / _30;
  _41 = MEM[(const struct BasePoint3D *)aNormal_26(D)].D.75826.D.75829.x;
  _39 = _41 * _88;
  _37 = MEM[(const struct BasePoint3D *)aNormal_26(D)].D.75826.D.75829.y;
  _33 = _37 * _89;
  _27 = _33 + _39;
  _45 = MEM[(const struct BasePoint3D *)aNormal_26(D)].D.75826.D.75829.z;
  _46 = _45 * _90;
  _47 = _27 + _46;
  if (_47 >= 0.0)
goto ; [59.00%]
  else
goto ; [41.00%]


With -Ofast it gets bit more streamlined:


   [local count: 118111600]:
  _48 = MEM[(const struct BasePoint3D
*)aVectorToLight_25(D)].D.75826.D.75829.z;
  _49 = _48 + 1.0e+0;
  _50 = MEM[(const struct BasePoint3D
*)aVectorToLight_25(D)].D.75826.D.75829.y;
  _51 = MEM[(const struct BasePoint3D
*)aVectorToLight_25(D)].D.75826.D.75829.x;
  powmult_78 = _51 * _51;
  powmult_80 = _50 * _50;
  _81 = powmult_78 + powmult_80;
  powmult_83 = _49 * _49;
  _84 = _81 + powmult_83;
  _85 = __builtin_sqrtf (_84);
  _86 = _51 / _85;
  _87 = _50 / _85;
  _88 = _49 / _85;
  _41 = MEM[(const struct BasePoint3D *)aNormal_26(D)].D.75826.D.75829.x;
  _39 = _41 * _86;
  _37 = MEM[(const struct BasePoint3D *)aNormal_26(D)].D.75826.D.75829.y;
  _33 = _37 * _87;
  _27 = _33 + _39;
  _45 = MEM[(const struct BasePoint3D *)aNormal_26(D)].D.75826.D.75829.z;
  _46 = _45 * _88;
  _47 = _27 + _46;
  if (_47 >= 0.0)
goto ; [59.00%]
  else
goto ; [41.00%]

But I do not quite see in the slp dump why this is not considered for
vectorization.

I attach the dump.
Honza

[Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not

2021-12-22 Thread marxin at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797

--- Comment #3 from Martin Liška  ---
(In reply to hubicka from comment #2)
> > Can you please attach a reduced test-case?
> Do you know how to produce one with a reasonable effort?

-E and remove not needed code.

> The
> declaratoins are quite convoluted, but the function is well isolated and
> easy to inspect from full one...

Do we speak about:
https://github.com/mozilla/gecko-dev/blob/bd25b1ca76dd5d323ffc69557f6cf759ba76ba23/gfx/2d/FilterNodeSoftware.cpp#L3670-L3691
?

It should be possible creating a synthetical test that does the same (and lives
in a loop, right?).

[Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not

2021-12-22 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797

--- Comment #2 from hubicka at kam dot mff.cuni.cz ---
> Can you please attach a reduced test-case?
Do you know how to produce one with a reasonable effort?  The
declaratoins are quite convoluted, but the function is well isolated and
easy to inspect from full one...

[Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not

2021-12-22 Thread marxin at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797

Martin Liška  changed:

   What|Removed |Added

 Status|UNCONFIRMED |WAITING
 Ever confirmed|0   |1
 CC||marxin at gcc dot gnu.org
   Last reconfirmed||2021-12-22

--- Comment #1 from Martin Liška  ---
Can you please attach a reduced test-case?

[Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not

2021-12-21 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797

Andrew Pinski  changed:

   What|Removed |Added

   Keywords||missed-optimization
   Severity|normal  |enhancement