[Bug target/91735] [9/10 Regression] Runtime regression for SPEC2000 177.mesa on Haswell around the end of August 2018

2020-03-12 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91735

Jakub Jelinek  changed:

   What|Removed |Added

   Target Milestone|9.3 |9.4

--- Comment #12 from Jakub Jelinek  ---
GCC 9.3.0 has been released, adjusting target milestone.

[Bug target/91735] [9/10 Regression] Runtime regression for SPEC2000 177.mesa on Haswell around the end of August 2018

2020-01-17 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91735

Richard Biener  changed:

   What|Removed |Added

   Priority|P3  |P2

[Bug target/91735] [9/10 Regression] Runtime regression for SPEC2000 177.mesa on Haswell around the end of August 2018

2019-09-13 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91735

--- Comment #11 from Richard Biener  ---
Created attachment 46880
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46880=edit
prototype

This improves code-gen to use pextrw where possible but that doesn't make any
measurable difference on runtime.  Maybe the example loop isn't representative
or the improvement isn't big enough.

[Bug target/91735] [9/10 Regression] Runtime regression for SPEC2000 177.mesa on Haswell around the end of August 2018

2019-09-11 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91735

--- Comment #10 from Richard Biener  ---
Can't really decipher what clang does here.  it seems to handle even/odd
lanes separately, doing 24 vpextrb stores per loop iteration.  Possibly
simply an interleaving scheme...

[Bug target/91735] [9/10 Regression] Runtime regression for SPEC2000 177.mesa on Haswell around the end of August 2018

2019-09-11 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91735

--- Comment #9 from Richard Biener  ---
(In reply to Richard Biener from comment #8)
> The most trivial improvement is likely to recognize the vector parts we can
> store via HImode.  There's already support for that but only if we can
> uniformly
> use HImode and not a mix of sizes.

While for loads we need N "same" pieces to be able to build the CONSTRUCTOR
for stores we can do arbitrary extracts so the strided store code could
be refactored to decide on that in the main loop walking over the actual
elements to store rather than computing this upfront (I sort-of copied the
handling from the strided load code retaining this restriction).  Might
get rid of 1/3 of the pextracts.

[Bug target/91735] [9/10 Regression] Runtime regression for SPEC2000 177.mesa on Haswell around the end of August 2018

2019-09-11 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91735

--- Comment #8 from Richard Biener  ---
The most trivial improvement is likely to recognize the vector parts we can
store via HImode.  There's already support for that but only if we can
uniformly
use HImode and not a mix of sizes.

[Bug target/91735] [9/10 Regression] Runtime regression for SPEC2000 177.mesa on Haswell around the end of August 2018

2019-09-11 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91735

--- Comment #7 from rguenther at suse dot de  ---
On Wed, 11 Sep 2019, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91735
> 
> Jakub Jelinek  changed:
> 
>What|Removed |Added
> 
>  CC||jakub at gcc dot gnu.org
> 
> --- Comment #4 from Jakub Jelinek  ---
> The endless series of vpextrb look terrible, can't that be handled by possibly
> masked permutation?

Sure, just nobody implemented support for that into the strided
store code (likewise for strided loads).  I'm also not sure it is
really faster in the end.  Maybe VPMULTISHIFTQB can also help.

[Bug target/91735] [9/10 Regression] Runtime regression for SPEC2000 177.mesa on Haswell around the end of August 2018

2019-09-11 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91735

--- Comment #6 from rguenther at suse dot de  ---
On Wed, 11 Sep 2019, ubizjak at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91735
> 
> --- Comment #5 from Uroš Bizjak  ---
> (In reply to Richard Biener from comment #3)
> > Reducing the VF here should be the goal.  For the particular case "filling"
> > the holes with neutral data and blending in the original values at store 
> > time
> > will likely be optimal.  So do
> > 
> >   tem = vector load
> >   zero all [4] elements
> >   compute
> >   blend in 'tem' into the [4] elements
> >   vector store
> 
> MASKMOVDQU [1] should be an excellent fit here.

Yes, but it's probably slower.  And it avoids store data races,
of course plus avoids epilogue peeling (eventually).

[Bug target/91735] [9/10 Regression] Runtime regression for SPEC2000 177.mesa on Haswell around the end of August 2018

2019-09-11 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91735

--- Comment #5 from Uroš Bizjak  ---
(In reply to Richard Biener from comment #3)
> Reducing the VF here should be the goal.  For the particular case "filling"
> the holes with neutral data and blending in the original values at store time
> will likely be optimal.  So do
> 
>   tem = vector load
>   zero all [4] elements
>   compute
>   blend in 'tem' into the [4] elements
>   vector store

MASKMOVDQU [1] should be an excellent fit here.

[1] https://www.felixcloutier.com/x86/maskmovdqu

[Bug target/91735] [9/10 Regression] Runtime regression for SPEC2000 177.mesa on Haswell around the end of August 2018

2019-09-11 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91735

Jakub Jelinek  changed:

   What|Removed |Added

 CC||jakub at gcc dot gnu.org

--- Comment #4 from Jakub Jelinek  ---
The endless series of vpextrb look terrible, can't that be handled by possibly
masked permutation?

[Bug target/91735] [9/10 Regression] Runtime regression for SPEC2000 177.mesa on Haswell around the end of August 2018

2019-09-11 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91735

--- Comment #3 from Richard Biener  ---
Reducing the VF here should be the goal.  For the particular case "filling"
the holes with neutral data and blending in the original values at store time
will likely be optimal.  So do

  tem = vector load
  zero all [4] elements
  compute
  blend in 'tem' into the [4] elements
  vector store

eliding all the shuffling/striding.  Should end up at a VF of 4 (SSE) or 8
(AVX).

Doesn't fit very well into the current vectorizer architecture.

So currently we can only address this from the costing side.

arm can probably leverage load/store-lanes here.

With char elements and an SLP size of 3 it's probably the worst case we can
think of.

[Bug target/91735] [9/10 Regression] Runtime regression for SPEC2000 177.mesa on Haswell around the end of August 2018

2019-09-11 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91735

--- Comment #2 from Richard Biener  ---
Errr, before we _dont_ vectorize.

[Bug target/91735] [9/10 Regression] Runtime regression for SPEC2000 177.mesa on Haswell around the end of August 2018

2019-09-11 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91735

Richard Biener  changed:

   What|Removed |Added

 Target||x86_64-*-*
 Status|UNCONFIRMED |NEW
  Known to work||8.3.1
Version|unknown |9.1.0
   Keywords||missed-optimization
   Last reconfirmed||2019-09-11
 CC||rguenth at gcc dot gnu.org,
   ||rsandifo at gcc dot gnu.org
 Blocks||53947, 26163
 Ever confirmed|0   |1
Summary|Runtime regression for  |[9/10 Regression] Runtime
   |SPEC2000 177.mesa on|regression for SPEC2000
   |Haswell around the end of   |177.mesa on Haswell around
   |August 2018 |the end of August 2018
   Target Milestone|--- |9.3

--- Comment #1 from Richard Biener  ---
From the testers data last good r263752, first bad r263787.

Bisecting points to Richards vectorizer series r26377[1-4], more specifically
r263772.  Perf shows nothing conclusive but all functions slower by the
same percentage.

SPEC 2000 build scripts are oddly redirecting and mangling output so -fopt-info
output is illegible.  Huh, or rather it's even in the dumps when dumping
with -optimized:

polygon.c:140:4: note: polygon.c:140:4: note:  polygon.c:140:4: note:  
polygon.c:140:4: note:  polygon.c:140:4: note:  polygon.c:140:4: note: 
polygon.c:140:4: note:  polygon.c:140:4: note:  polygon.c:140:4: note: 
polygon.c:140:4: note:  polygon.c:140:4: note:  polygon.c:140:4: note: 
polygon.c:140:4: note:  polygon.c:140:4: note:  polygon.c:140:4: note: 
polygon.c:140:4: note:  polygon.c:140:4: note:  polygon.c:140:4: note:  
polygon.c:140:4: note:   polygon.c:140:4: note:   polygon.c:140:4: note:  
polygon.c:140:4: note:   polygon.c:140:4: note:   polygon.c:140:4: note:  
polygon.c:140:4: note:   polygon.c:140:4: note:   polygon.c:140:4: note:  
polygon.c:140:4: note:   polygon.c:140:4: note:   polygon.c:140:4: note: loop
vectorized using 32 byte vectors

anyhow, differences are for example:

fog.c:157:10: note: loop vectorized using 32 byte vectors
+fog.c:157:10: note: fog.c:157:10: note:  loop versioned for vectorization
because of possible aliasing

the above is

void gl_fog_color_vertices( GLcontext *ctx,
GLuint n, GLfloat v[][4], GLubyte color[][4] )
...

  case GL_EXP:
 d = -ctx->Fog.Density;
 for (i=0;ihttps://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
[Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations