[Bug target/27855] [6/7/8 regression] reassociation causes the RA to be confused
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=27855 Aldy Hernandez changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #58 from Aldy Hernandez --- (In reply to rguent...@suse.de from comment #57) > > IMO, "manually optimized" doesn't seem to merit spending more time on this. > > As > > you mention, not real world enough :). May I suggest WONTFIX, but willing > > to > > reopen if someone cares enough to un-obfuscate the code? > > I'd say FIXED - that sounds nicer ;) My pleasure!
[Bug target/27855] [6/7/8 regression] reassociation causes the RA to be confused
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=27855 --- Comment #57 from rguenther at suse dot de --- On Mon, 5 Feb 2018, aldyh at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=27855 > > --- Comment #56 from Aldy Hernandez --- > (In reply to Richard Biener from comment #55) > > > Note the original report used -O and Aldhy used -O2 but we are talking > > about a benchmark and when you use -ffast-math you also use -O3. > > Thanks, I will keep a note of this for next time. FWIW, I used the same flags > as the last time this was reconfirmed, which was comment #52. I also saw a > bunch of runs at -O2 by Uros, which likely influenced me. > > > The benchmark is somewhat badly written (manually "optimized") so our > > vectorization attempts fail. > > > > Overall conclusion is I'm unsure if it's worth pursuing this bug further? > > There is a register pressure issue left but the testcase maybe not > > real-world enough? That is, I would usually recommend to first un-obfuscate > > the manually optimized code. > > IMO, "manually optimized" doesn't seem to merit spending more time on this. > As > you mention, not real world enough :). May I suggest WONTFIX, but willing to > reopen if someone cares enough to un-obfuscate the code? I'd say FIXED - that sounds nicer ;)
[Bug target/27855] [6/7/8 regression] reassociation causes the RA to be confused
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=27855 --- Comment #56 from Aldy Hernandez --- (In reply to Richard Biener from comment #55) > Note the original report used -O and Aldhy used -O2 but we are talking > about a benchmark and when you use -ffast-math you also use -O3. Thanks, I will keep a note of this for next time. FWIW, I used the same flags as the last time this was reconfirmed, which was comment #52. I also saw a bunch of runs at -O2 by Uros, which likely influenced me. > The benchmark is somewhat badly written (manually "optimized") so our > vectorization attempts fail. > > Overall conclusion is I'm unsure if it's worth pursuing this bug further? > There is a register pressure issue left but the testcase maybe not > real-world enough? That is, I would usually recommend to first un-obfuscate > the manually optimized code. IMO, "manually optimized" doesn't seem to merit spending more time on this. As you mention, not real world enough :). May I suggest WONTFIX, but willing to reopen if someone cares enough to un-obfuscate the code?
[Bug target/27855] [6/7/8 regression] reassociation causes the RA to be confused
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=27855 --- Comment #55 from Richard Biener --- I think the original report is about x87 math vs. SSE math. It's a bit hard to benchmark this through the releases given changes in tuning and vector feature sets (-march=native is out of the question). So I use -O3 -ffast-math -DREPS=10 -m32 as base and see ISA4.3.6 4.6.4 4.8.5 7.2 -mno-sse 1855 6930 4618 5623 -msse2 -mfpmath=sse 1967 6945 4744 6472 -m64 2977 6917 4935 6205 note I edited the benchmark and put noinline,noclone attributes on the gemm_atlas function. I benchmarked on a broadwell system with minimal CPU frequency boosting but still varying REPS varies the reported MFLOPS _a lot_ (but individual runs are somewhat stable, for the last reported number 6205 I also can get 6331 or 6186). I used the attached benchmark, the cited URL doesn't work anymore. So there's still an appearant regression, the trunk numbers aren't very different from the 7.2 results, the 4.6.4 variant is still fastest and we recovered to current levels with 4.9.4 already (just checked -m64 across all releases). With -march=native I get to new heights obviously because we use things like FMA, AVX, etc. if I add just -mavx to 4.6.4 it's not faster than without but 7.2 improves to 6628 for example (4.6.4 doesn't know AVX2 and -mfma results in bogus assembler being generated...). If I look at the generated code for -m64 (with just SSE) we no longer spill a lot in the inner loop (only once) and we don't vectorize. 4.6.4 manages to avoid any spilling in the computation (even in the outer loop). So the original analysis (RA sucks) still holds. Note the original report used -O and Aldhy used -O2 but we are talking about a benchmark and when you use -ffast-math you also use -O3. Note the biggest regression we still see is with x87 math - I think we can reasonably disregard that now. The benchmark is somewhat badly written (manually "optimized") so our vectorization attempts fail. Overall conclusion is I'm unsure if it's worth pursuing this bug further? There is a register pressure issue left but the testcase maybe not real-world enough? That is, I would usually recommend to first un-obfuscate the manually optimized code.
[Bug target/27855] [6/7/8 regression] reassociation causes the RA to be confused
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=27855 Aldy Hernandez changed: What|Removed |Added CC||aldyh at gcc dot gnu.org --- Comment #54 from Aldy Hernandez --- Results from comparing trunk and gcc-4_6-branch on (average of 7 runs): Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz 4.6.x -O2 -ffast-math -msse4.1 -DTYPE=double -fno-tree-reassoc atlasmm 60 1000 0.100 5415.4 -O2 -ffast-math -msse4.1 -DTYPE=double atlasmm 60 1000 0.082 4797.4 Performance penalty: 12.9% trunk: -O2 -ffast-math -msse4.1 -DTYPE=double -fno-tree-reassoc atlasmm 60 1000 0.105 5217.0 -O2 -ffast-math -msse4.1 -DTYPE=double atlasmm 60 1000 0.108 4773.0 Performance change: 9.3% It looks like the penalty for using fno-tree-reassoc now is even less than it was in 4.6.x. The original report was for a 30% drop. Has this been fixed, or am I missing something? (different flags for my 4.6 run? Is the 4.6 branch not a suitable representation of the 4.6.2 compiler against which this bug was originally filed against?)