[Bug target/27855] [6/7/8 regression] reassociation causes the RA to be confused

2018-02-05 Thread aldyh at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=27855

Aldy Hernandez  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #58 from Aldy Hernandez  ---
(In reply to rguent...@suse.de from comment #57)

> > IMO, "manually optimized" doesn't seem to merit spending more time on this. 
> >  As
> > you mention, not real world enough :).  May I suggest WONTFIX, but willing 
> > to
> > reopen if someone cares enough to un-obfuscate the code?
> 
> I'd say FIXED - that sounds nicer ;)

My pleasure!

[Bug target/27855] [6/7/8 regression] reassociation causes the RA to be confused

2018-02-05 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=27855

--- Comment #57 from rguenther at suse dot de  ---
On Mon, 5 Feb 2018, aldyh at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=27855
> 
> --- Comment #56 from Aldy Hernandez  ---
> (In reply to Richard Biener from comment #55)
> 
> > Note the original report used -O and Aldhy used -O2 but we are talking
> > about a benchmark and when you use -ffast-math you also use -O3.
> 
> Thanks, I will keep a note of this for next time.  FWIW, I used the same flags
> as the last time this was reconfirmed, which was comment #52.  I also saw a
> bunch of runs at -O2 by Uros, which likely influenced me.
> 
> > The benchmark is somewhat badly written (manually "optimized") so our
> > vectorization attempts fail.
> > 
> > Overall conclusion is I'm unsure if it's worth pursuing this bug further?
> > There is a register pressure issue left but the testcase maybe not
> > real-world enough?  That is, I would usually recommend to first un-obfuscate
> > the manually optimized code.
> 
> IMO, "manually optimized" doesn't seem to merit spending more time on this.  
> As
> you mention, not real world enough :).  May I suggest WONTFIX, but willing to
> reopen if someone cares enough to un-obfuscate the code?

I'd say FIXED - that sounds nicer ;)

[Bug target/27855] [6/7/8 regression] reassociation causes the RA to be confused

2018-02-05 Thread aldyh at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=27855

--- Comment #56 from Aldy Hernandez  ---
(In reply to Richard Biener from comment #55)

> Note the original report used -O and Aldhy used -O2 but we are talking
> about a benchmark and when you use -ffast-math you also use -O3.

Thanks, I will keep a note of this for next time.  FWIW, I used the same flags
as the last time this was reconfirmed, which was comment #52.  I also saw a
bunch of runs at -O2 by Uros, which likely influenced me.

> The benchmark is somewhat badly written (manually "optimized") so our
> vectorization attempts fail.
> 
> Overall conclusion is I'm unsure if it's worth pursuing this bug further?
> There is a register pressure issue left but the testcase maybe not
> real-world enough?  That is, I would usually recommend to first un-obfuscate
> the manually optimized code.

IMO, "manually optimized" doesn't seem to merit spending more time on this.  As
you mention, not real world enough :).  May I suggest WONTFIX, but willing to
reopen if someone cares enough to un-obfuscate the code?

[Bug target/27855] [6/7/8 regression] reassociation causes the RA to be confused

2018-02-05 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=27855

--- Comment #55 from Richard Biener  ---
I think the original report is about x87 math vs. SSE math.  It's a bit hard to
benchmark this through the releases given changes in tuning and vector feature
sets (-march=native is out of the question).

So I use -O3 -ffast-math -DREPS=10 -m32 as base and see

  ISA4.3.6  4.6.4  4.8.5  7.2
-mno-sse  1855   6930   4618   5623
-msse2 -mfpmath=sse   1967   6945   4744   6472
-m64  2977   6917   4935   6205

note I edited the benchmark and put noinline,noclone attributes on the
gemm_atlas function.  I benchmarked on a broadwell system with minimal
CPU frequency boosting but still varying REPS varies the reported MFLOPS
_a lot_ (but individual runs are somewhat stable, for the last reported
number 6205 I also can get 6331 or 6186).  I used the attached
benchmark, the cited URL doesn't work anymore.

So there's still an appearant regression, the trunk numbers aren't very
different
from the 7.2 results, the 4.6.4 variant is still fastest and we recovered to
current levels with 4.9.4 already (just checked -m64 across all releases).

With -march=native I get to new heights obviously because we use things like
FMA, AVX, etc. if I add just -mavx to 4.6.4 it's not faster than without
but 7.2 improves to 6628 for example (4.6.4 doesn't know AVX2 and -mfma
results in bogus assembler being generated...).

If I look at the generated code for -m64 (with just SSE) we no longer
spill a lot in the inner loop (only once) and we don't vectorize.  4.6.4
manages to avoid any spilling in the computation (even in the outer loop).
So the original analysis (RA sucks) still holds.

Note the original report used -O and Aldhy used -O2 but we are talking
about a benchmark and when you use -ffast-math you also use -O3.

Note the biggest regression we still see is with x87 math - I think we
can reasonably disregard that now.

The benchmark is somewhat badly written (manually "optimized") so our
vectorization attempts fail.

Overall conclusion is I'm unsure if it's worth pursuing this bug further?
There is a register pressure issue left but the testcase maybe not
real-world enough?  That is, I would usually recommend to first un-obfuscate
the manually optimized code.

[Bug target/27855] [6/7/8 regression] reassociation causes the RA to be confused

2018-02-01 Thread aldyh at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=27855

Aldy Hernandez  changed:

   What|Removed |Added

 CC||aldyh at gcc dot gnu.org

--- Comment #54 from Aldy Hernandez  ---
Results from comparing trunk and gcc-4_6-branch on (average of 7 runs):

Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz

4.6.x
-O2 -ffast-math -msse4.1 -DTYPE=double -fno-tree-reassoc
atlasmm   60   1000   0.100 5415.4

-O2 -ffast-math -msse4.1 -DTYPE=double
atlasmm   60   1000   0.082 4797.4

Performance penalty: 12.9%

trunk:
-O2 -ffast-math -msse4.1 -DTYPE=double -fno-tree-reassoc
atlasmm   60   1000   0.105 5217.0

-O2 -ffast-math -msse4.1 -DTYPE=double
atlasmm   60   1000   0.108 4773.0

Performance change: 9.3%

It looks like the penalty for using fno-tree-reassoc now is even less than it
was in 4.6.x.  The original report was for a 30% drop.

Has this been fixed, or am I missing something? (different flags for my 4.6
run?  Is the 4.6 branch not a suitable representation of the 4.6.2 compiler
against which this bug was originally filed against?)