[Bug target/91735] [9/10 Regression] Runtime regression for SPEC2000 177.mesa on Haswell around the end of August 2018
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91735 Jakub Jelinek changed: What|Removed |Added Target Milestone|9.3 |9.4 --- Comment #12 from Jakub Jelinek --- GCC 9.3.0 has been released, adjusting target milestone.
[Bug target/91735] [9/10 Regression] Runtime regression for SPEC2000 177.mesa on Haswell around the end of August 2018
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91735 Richard Biener changed: What|Removed |Added Priority|P3 |P2
[Bug target/91735] [9/10 Regression] Runtime regression for SPEC2000 177.mesa on Haswell around the end of August 2018
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91735 --- Comment #11 from Richard Biener --- Created attachment 46880 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46880&action=edit prototype This improves code-gen to use pextrw where possible but that doesn't make any measurable difference on runtime. Maybe the example loop isn't representative or the improvement isn't big enough.
[Bug target/91735] [9/10 Regression] Runtime regression for SPEC2000 177.mesa on Haswell around the end of August 2018
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91735 --- Comment #10 from Richard Biener --- Can't really decipher what clang does here. it seems to handle even/odd lanes separately, doing 24 vpextrb stores per loop iteration. Possibly simply an interleaving scheme...
[Bug target/91735] [9/10 Regression] Runtime regression for SPEC2000 177.mesa on Haswell around the end of August 2018
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91735 --- Comment #9 from Richard Biener --- (In reply to Richard Biener from comment #8) > The most trivial improvement is likely to recognize the vector parts we can > store via HImode. There's already support for that but only if we can > uniformly > use HImode and not a mix of sizes. While for loads we need N "same" pieces to be able to build the CONSTRUCTOR for stores we can do arbitrary extracts so the strided store code could be refactored to decide on that in the main loop walking over the actual elements to store rather than computing this upfront (I sort-of copied the handling from the strided load code retaining this restriction). Might get rid of 1/3 of the pextracts.
[Bug target/91735] [9/10 Regression] Runtime regression for SPEC2000 177.mesa on Haswell around the end of August 2018
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91735 --- Comment #8 from Richard Biener --- The most trivial improvement is likely to recognize the vector parts we can store via HImode. There's already support for that but only if we can uniformly use HImode and not a mix of sizes.
[Bug target/91735] [9/10 Regression] Runtime regression for SPEC2000 177.mesa on Haswell around the end of August 2018
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91735 --- Comment #7 from rguenther at suse dot de --- On Wed, 11 Sep 2019, jakub at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91735 > > Jakub Jelinek changed: > >What|Removed |Added > > CC||jakub at gcc dot gnu.org > > --- Comment #4 from Jakub Jelinek --- > The endless series of vpextrb look terrible, can't that be handled by possibly > masked permutation? Sure, just nobody implemented support for that into the strided store code (likewise for strided loads). I'm also not sure it is really faster in the end. Maybe VPMULTISHIFTQB can also help.
[Bug target/91735] [9/10 Regression] Runtime regression for SPEC2000 177.mesa on Haswell around the end of August 2018
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91735 --- Comment #6 from rguenther at suse dot de --- On Wed, 11 Sep 2019, ubizjak at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91735 > > --- Comment #5 from Uroš Bizjak --- > (In reply to Richard Biener from comment #3) > > Reducing the VF here should be the goal. For the particular case "filling" > > the holes with neutral data and blending in the original values at store > > time > > will likely be optimal. So do > > > > tem = vector load > > zero all [4] elements > > compute > > blend in 'tem' into the [4] elements > > vector store > > MASKMOVDQU [1] should be an excellent fit here. Yes, but it's probably slower. And it avoids store data races, of course plus avoids epilogue peeling (eventually).
[Bug target/91735] [9/10 Regression] Runtime regression for SPEC2000 177.mesa on Haswell around the end of August 2018
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91735 --- Comment #5 from Uroš Bizjak --- (In reply to Richard Biener from comment #3) > Reducing the VF here should be the goal. For the particular case "filling" > the holes with neutral data and blending in the original values at store time > will likely be optimal. So do > > tem = vector load > zero all [4] elements > compute > blend in 'tem' into the [4] elements > vector store MASKMOVDQU [1] should be an excellent fit here. [1] https://www.felixcloutier.com/x86/maskmovdqu
[Bug target/91735] [9/10 Regression] Runtime regression for SPEC2000 177.mesa on Haswell around the end of August 2018
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91735 Jakub Jelinek changed: What|Removed |Added CC||jakub at gcc dot gnu.org --- Comment #4 from Jakub Jelinek --- The endless series of vpextrb look terrible, can't that be handled by possibly masked permutation?
[Bug target/91735] [9/10 Regression] Runtime regression for SPEC2000 177.mesa on Haswell around the end of August 2018
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91735 --- Comment #3 from Richard Biener --- Reducing the VF here should be the goal. For the particular case "filling" the holes with neutral data and blending in the original values at store time will likely be optimal. So do tem = vector load zero all [4] elements compute blend in 'tem' into the [4] elements vector store eliding all the shuffling/striding. Should end up at a VF of 4 (SSE) or 8 (AVX). Doesn't fit very well into the current vectorizer architecture. So currently we can only address this from the costing side. arm can probably leverage load/store-lanes here. With char elements and an SLP size of 3 it's probably the worst case we can think of.
[Bug target/91735] [9/10 Regression] Runtime regression for SPEC2000 177.mesa on Haswell around the end of August 2018
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91735 --- Comment #2 from Richard Biener --- Errr, before we _dont_ vectorize.
[Bug target/91735] [9/10 Regression] Runtime regression for SPEC2000 177.mesa on Haswell around the end of August 2018
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91735 Richard Biener changed: What|Removed |Added Target||x86_64-*-* Status|UNCONFIRMED |NEW Known to work||8.3.1 Version|unknown |9.1.0 Keywords||missed-optimization Last reconfirmed||2019-09-11 CC||rguenth at gcc dot gnu.org, ||rsandifo at gcc dot gnu.org Blocks||53947, 26163 Ever confirmed|0 |1 Summary|Runtime regression for |[9/10 Regression] Runtime |SPEC2000 177.mesa on|regression for SPEC2000 |Haswell around the end of |177.mesa on Haswell around |August 2018 |the end of August 2018 Target Milestone|--- |9.3 --- Comment #1 from Richard Biener --- From the testers data last good r263752, first bad r263787. Bisecting points to Richards vectorizer series r26377[1-4], more specifically r263772. Perf shows nothing conclusive but all functions slower by the same percentage. SPEC 2000 build scripts are oddly redirecting and mangling output so -fopt-info output is illegible. Huh, or rather it's even in the dumps when dumping with -optimized: polygon.c:140:4: note: polygon.c:140:4: note: polygon.c:140:4: note: polygon.c:140:4: note: polygon.c:140:4: note: polygon.c:140:4: note: polygon.c:140:4: note: polygon.c:140:4: note: polygon.c:140:4: note: polygon.c:140:4: note: polygon.c:140:4: note: polygon.c:140:4: note: polygon.c:140:4: note: polygon.c:140:4: note: polygon.c:140:4: note: polygon.c:140:4: note: polygon.c:140:4: note: polygon.c:140:4: note: polygon.c:140:4: note: polygon.c:140:4: note: polygon.c:140:4: note: polygon.c:140:4: note: polygon.c:140:4: note: polygon.c:140:4: note: polygon.c:140:4: note: polygon.c:140:4: note: polygon.c:140:4: note: polygon.c:140:4: note: polygon.c:140:4: note: polygon.c:140:4: note: loop vectorized using 32 byte vectors anyhow, differences are for example: fog.c:157:10: note: loop vectorized using 32 byte vectors +fog.c:157:10: note: fog.c:157:10: note: loop versioned for vectorization because of possible aliasing the above is void gl_fog_color_vertices( GLcontext *ctx, GLuint n, GLfloat v[][4], GLubyte color[][4] ) ... case GL_EXP: d = -ctx->Fog.Density; for (i=0;ihttps://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163 [Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95) https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 [Bug 53947] [meta-bug] vectorizer missed-optimizations