[Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3

2006-08-11 Thread uros at kss-loka dot si


--- Comment #64 from uros at kss-loka dot si  2006-08-11 09:18 ---
Slightly offtopic, but to put some numbers to comment #8 and comment #11,
equivalent SSE code now reaches only 50% of x87 single performance and 60% of
x87 double performance on AMD x86_64:


ALGORITHM NB   REPSTIME  MFLOPS
=  =  =  ==  ==

[float] -O2 -mfpmath=sse -march=k8:
atlasmm   60   1000   0.273 1582.66
[float] -O2 -mfpmath=387 -march=k8:
atlasmm   60   1000   0.138 3130.91

[double] -O2 -mfpmath=sse -march=k8:
atlasmm   60   1000   0.252 1714.54
[double] -O2 -mfpmath=387 -march=k8:
atlasmm   60   1000   0.152 2842.55

This effect was first observed in PR19780.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827



[Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3

2006-08-11 Thread bonzini at gcc dot gnu dot org


--- Comment #65 from bonzini at gnu dot org  2006-08-11 13:26 ---
Subject: Bug 27827

Author: bonzini
Date: Fri Aug 11 13:25:58 2006
New Revision: 116082

URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=116082
Log:
2006-08-11  Paolo Bonzini  [EMAIL PROTECTED]

PR target/27827
* config/i386/i386.md: Add peephole2 to avoid fld %st
instructions.

testsuite:
2006-08-11  Paolo Bonzini  [EMAIL PROTECTED]

PR target/27827
* gcc.target/i386/pr27827.c: New testcase.


Added:
branches/gcc-4_1-branch/gcc/testsuite/gcc.target/i386/pr27827.c
  - copied unchanged from r115969,
trunk/gcc/testsuite/gcc.target/i386/pr27827.c
Modified:
branches/gcc-4_1-branch/gcc/ChangeLog
branches/gcc-4_1-branch/gcc/config/i386/i386.md
branches/gcc-4_1-branch/gcc/testsuite/ChangeLog


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827



[Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3

2006-08-10 Thread paolo dot bonzini at lu dot unisi dot ch


--- Comment #59 from paolo dot bonzini at lu dot unisi dot ch  2006-08-10 
06:52 ---
Subject: Re:  [4.0/4.1 Regression] gcc 4 produces worse
 x87 code on all platforms than gcc 3


 Thanks for the response, but I believe you are conflating two issues (as is
 this flag, which is why this is bad news).  Different answers to the question
 what is this sum does not ruin IEEE compliance.  I am referring to IEEE 754,
 which is a standard set of rules for storage and arithmetic for floating point
 (fp) on modern hardware.
You are also confusing -funsafe-math-optimizations with -ffast-math.  
The latter is a one catch all flag that compiles as if there were no 
FP traps, infinities, NaNs, and so on.  The former instead enables 
unsafe optimizations but not catastrophic optimizations -- if you 
consider meaningless results on badly conditioned matrixes to not be 
catastrophic...

A more or less complete list of things enabled by 
-funsafe-math-optimizations includes:

Reassociation:
- reassociation of operations, not only for the vectorizer's sake but 
also in the unroller (see around line 1600 of loop-unroll.c)
- other simplifications like a/(b*c) for a/b/c
- expansion of pow (a, b) to multiplications if b is integer

Compile-time evaluation:
- doing more aggressive compile-time evaluation of floating-point 
expressions (e.g. cabs)
- less accurate modeling of overflow in compile-time expressions, for 
formats such as 106-bit mantissa long doubles

Math identities:
- expansion of cabs to sqrt (a*a + b*b)
- simplifications involving trascendental functions, e.g. exp (0.5*x) 
for sqrt (exp (x)), or x for tan(atan(x))
- moving terms to the other side of a comparison, e.g. a  4 for a + 4  
8, or x  -1 for 1 - x  2
- assuming in-domain arguments of sqrt, log, etc., e.g. x for 
sqrt(x)*sqrt(x)
- in turn, this enables removing math functions from comparisons, e.g. x 
  4 for sqrt (x)  2

Optimization:
- strength reduction of a/b to a*(1/b), both as loop invariants and in 
code like vector normalization
- eliminating recursion for accumulator-like functions, i.e. f (n) = n 
+ f(n-1)

Back-end operation:
- using x87 builtins for transcendental functions

There may be bugs, but in general these optimizations are safe for 
infinities and NaNs, but not for signed zeros or (as I said) for very 
badly conditioned data.
 I am unaware of their being any rules on compilation.
   
Rules are determined by the language standards.  I believe that C 
mandates no reassociation; Fortran allows reassociation unless explicit 
parentheses are present in the source, but this is not (yet) implemented 
by GCC.

Paolo


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827



[Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3

2006-08-10 Thread whaley at cs dot utsa dot edu


--- Comment #60 from whaley at cs dot utsa dot edu  2006-08-10 14:08 ---
Paolo,

Thanks for the explanation of what -funsafe is presently doing.

You are also confusing -funsafe-math-optimizations with -ffast-math.

No, what I'm doing is reading the man page (the closest thing to a contract
between gcc and me on what it is doing with my code):
|  -funsafe-math-optimizations
|  Allow optimizations for floating-point arithmetic that (a) assume
|  that arguments and results are valid and (b) may violate IEEE or
|  ANSI standards.

The (b) in this statement prevents me, as a library provider that *must* be
able to reassure my users that I have done nothing to violate IEEE fp standard
(don't get me wrong, there's plenty of violations of the standard that occur in
hardware, but typically in well-understood ways by the scientists of those
platforms, and in the less important parts of the standard), from using this
flag.  I can't even use it after verifying that no optimization has hurt the
present code, because an optimization that violates IEEE could be added at a
later date, or used on a system that I'm not testing on (eg., on some systems,
could cause 3DNow! vectorization).

Rules are determined by the language standards.  I believe that C
mandates no reassociation; Fortran allows reassociation unless explicit
parentheses are present in the source, but this is not (yet) implemented
by GCC.

My precise point.  There are *lots* of C rules that a fp guy could give a crap
about (for certain types of fp kernels), but IEEE is pretty much inviolate. 
Since this flag conflates language violations (don't care) with IEEE
(catastrophic) I can't use it.  I cannot stress enough just how important IEEE
is: it is the only contract that tells us what it means to do a flop, and gives
us any way of understanding what our answer will be.

Making vectorization depend on a flag that says it is allowed to violate IEEE
is therefore a killer for me (and most knowledgable fp guys).  This is ironic,
since vectorization of sums (as in GEMM) is usually implemented as scalar
expansion on the accumulators, and this not only produces an IEEE-compliant
answer, but it is *more* accurate for almost all data.

Thanks,
Clint


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827



[Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3

2006-08-10 Thread paolo dot bonzini at lu dot unisi dot ch


--- Comment #61 from paolo dot bonzini at lu dot unisi dot ch  2006-08-10 
14:28 ---
Subject: Re:  [4.0/4.1 Regression] gcc 4 produces worse
 x87 code on all platforms than gcc 3


 Making vectorization depend on a flag that says it is allowed to violate IEEE
 is therefore a killer for me (and most knowledgable fp guys).  This is ironic,
 since vectorization of sums (as in GEMM) is usually implemented as scalar
 expansion on the accumulators
   
In case of GCC, it performs the transformation that Dorit explained.  It 
may not produce an IEEE-compliant answer if there are zeros and you 
expect to see a particular sign for the zero.
 and this not only produces an IEEE-compliant answer
   
The IEEE standard mandates particular rules for performing operations on 
infinities, NaNs, signed zeros, denormals, ...  The C standard, by 
mandating no reassociation, ensures that you don't mess with NaNs, 
infinities, and signed zeros.  As soon as you perform reassociation, 
there is *no way* you can be sure that you get IEEE-compliant math.

  +Inf + (1 / +0) = Inf, +Inf + (1 / -0) = NaN.
 but it is *more* accurate for almost all data.
http://citeseer.ist.psu.edu/589698.html is an example of a paper that 
shows FP code that avoids accuracy problems.  Any kind of reassociation 
will break that code, and lower its accuracy.  That's why reassociation 
is an unsafe math optimization.

If you want a -freassociate-fp math, open an enhancement PR and somebody 
might be more than happy to separate reassociation from the other 
effects of -funsafe-math-optimizations.

(Independent of this, you should also open a separate PR for ATLAS 
vectorization, because that would not be a regression and would not be 
on x87) :-)

Paolo


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827



[Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3

2006-08-10 Thread whaley at cs dot utsa dot edu


--- Comment #62 from whaley at cs dot utsa dot edu  2006-08-10 15:15 ---
Paolo,

The IEEE standard mandates particular rules for performing operations on
infinities, NaNs, signed zeros, denormals, ...  The C standard, by
mandating no reassociation, ensures that you don't mess with NaNs,
infinities, and signed zeros.  As soon as you perform reassociation,
there is *no way* you can be sure that you get IEEE-compliant math.

No, again this is a conflation of the issues.  You have IEEE-compliant math,
but the differing orderings provide different summations of those values.  It
is a ANSI/ISO C rule being violated, not an IEEE.  Each individual operation is
IEEE, and therefore both results are IEEE-compliant, but since the C rule
requiring order has been broken, some codes will break.  However, they break
not because of a violation of IEEE, but because of a violation of ANSI/ISO C. 
I can certify whether my code can take this violation of ANSI/ISO C by
examining my code.  I cannot certify my code works w/o IEEE by examining it,
since that means a+b is now essentially undefined.

http://citeseer.ist.psu.edu/589698.html is an example of a paper that
shows FP code that avoids accuracy problems.  Any kind of reassociation
will break that code, and lower its accuracy.  That's why reassociation
is an unsafe math optimization.

Please note I never argued it is was safe.  Violating the C usage rules is
always unsafe.  However, as explained above, I can certify my code for
reordering by examination, but nothing helps an IEEE violation.  My problem is
lumping in IEEE violations (such as 3dNow vectorization, or turning on non-IEEE
mode in SSE) with C violations.

If you want a -freassociate-fp math, open an enhancement PR and somebody

Ah, you mean like I asked about in end of 2nd paragraph of Comment #56?

might be more than happy to separate reassociation from the other
effects of -funsafe-math-optimizations.

What I'm arguing for is not lumping in violations of ISO/ANSI C with IEEE
violations, but you are right that this would fix my particular case.  From
what I see, -funsafe ought to be redefined as violating ANSI/ISO alone, and not
mention IEEE at all.

(Independent of this, you should also open a separate PR for ATLAS
vectorization, because that would not be a regression and would not be
on x87) :-)

You mean like I pleaded for in the last paragraph of Comment #38, but
reluctantly shoved in here because that's what people seemed to want? :)

Thanks,
Clint


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827



[Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3

2006-08-10 Thread paolo dot bonzini at lu dot unisi dot ch


--- Comment #63 from paolo dot bonzini at lu dot unisi dot ch  2006-08-10 
15:22 ---
Subject: Re:  [4.0/4.1 Regression] gcc 4 produces worse
 x87 code on all platforms than gcc 3


 If you want a -freassociate-fp math, open an enhancement PR and somebody
 
 Ah, you mean like I asked about in end of 2nd paragraph of Comment #56?
 (Independent of this, you should also open a separate PR for ATLAS
 vectorization, because that would not be a regression and would not be
 on x87) :-)
 
 You mean like I pleaded for in the last paragraph of Comment #38
Be bold.  Don't ask, just open PRs if you feel an issue is separate.  Go 
ahead now if you wish.  Having them closed or marked as duplicate is not 
a problem, and it is much easier to track than cluttering an existing PRs.

All these issues with ATLAS will not be visible to somebody looking for 
bug fixes known to fail in 4.2.0, because the original problem is now 
fixed in that version, and will soon be in 4.1.1 too.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827



[Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3

2006-08-09 Thread whaley at cs dot utsa dot edu


--- Comment #52 from whaley at cs dot utsa dot edu  2006-08-09 14:33 ---
Paolo,

In some sense, this is the peephole I would rather *not* do.  But the answer 
is yes. :-)

Ahh, got it :)

So, do you now agree that the bug would be fixed if the patch that is in GCC 
4.2 was backported to GCC 4.1 (so that your users can use that)?

Well, much as I might like to deny it, yes I must agree bug is fixed :)  I
think there might still be more performance to get, and initial timings show
that 4 may be slower than 3 on some systems.  However, it will also clearly be
faster than 3 on some (so far, most) systems, and so far, is competitive
everwhere, so not even I can call that a performance bug :)

And yes, getting it into the next gcc release would be very helpful for ATLAS.

And do you still see the abysmal x87 single-precision FP performance?

No, the problems were the same for both precisions.  I haven't retimed all the
systems, but here's the numbers I do have for the benchmark:

  DOUBLESINGLE
  PEAKgcc3/gccS/gcc4gcc3/gccS/gcc4
  ====
Pentium-D :   28002359/2417/20672685/2684/2362
Ath64-X2  :   56003681/4011/21023716/4256/2207
Opteron   :   32002590/2517/15072625/2800/1580
P4E   :   28001767/1754/14801914/1954/1609
PentiumIII:500239/238/225   407/393/283

As you can see, on the benchmark, the single precision numbers are better than
the double now.  I cannot get single precision to run at quite the impressive
93% of peak as double when exercising the code generator on the Ath64-X2, but
it gets a respectable 85% of peak (at these levels of performance, it takes
only very minor differences to drop from 93 to 85, so that's not that
unexpected: I am still investigating this).

Thanks for all the help,
Clint


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827



[Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3

2006-08-09 Thread whaley at cs dot utsa dot edu


--- Comment #53 from whaley at cs dot utsa dot edu  2006-08-09 15:52 ---
Created an attachment (id=12047)
 -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=12047action=view)
benchmark wt vectorizable kernel


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827



[Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3

2006-08-09 Thread whaley at cs dot utsa dot edu


--- Comment #54 from whaley at cs dot utsa dot edu  2006-08-09 16:08 ---
Dorit,

OK, I've posted a new tarfile with a safe kernel code where the loop is not
unrolled, so that the vectorizer has a chance.  With this kernel, I can make it
vectorize code, but only if I throw the -funsafe-math-optimizations flag.  This
kernel doesn't use a lot of registers, so it should work for both x86-32 and
x86-64 archs.

I would expect for the vectorized code to beat the x87 in both precisions on
the P4E (vector SSE has two and four times the peak of x87 respectively), and
beat the x87 code in single on the Ath64 (twice the peak).  So far,
vectorization is never a win on the P4e, but I can make single win on Ath64. 
On both platforms, editing the assembly confirms that there are loops in there
that use the vector instructions.  Once I understand better what's going on,
maybe I can improve this . . .

Here's some questions I need to figure out:
(1) Why do I have to throw the -funsafe-math-optimizations flag to enable this?
   -- I see where the .vect file warns of it, but it refers to an SSA line,
  so I'm not sure what's going on.
   -- ATLAS cannot throw this flag, because it enables non-IEEE fp arithmetic,
  and ATLAS must maintain IEEE compliance.  SSE itself does *not* require
  ruining IEEE compliance.
   -- Let me know if there is some way in the code that I can avoid this prob
   -- If it cannot be avoided, is there a way to make this optimization
  controlled by a flag that does not mean a loss of IEEE compliance?
(2) Is there any pragma or assertion, etc, that I can put in the code to
notify the compiler that certain pointers point to 16-byte aligned data?
-- Only the output array (C) is possibly misaligned in ATLAS

Thanks,
Clint


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827



Re: [Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3

2006-08-09 Thread Dorit Nuzman

 Here's some questions I need to figure out:
 (1) Why do I have to throw the -funsafe-math-optimizations flag to
 enable this?
-- I see where the .vect file warns of it, but it refers to an SSA
line,
   so I'm not sure what's going on.

This flag is needed in order to allow vectorization of reduction (summation
in your case) of floating-point data. This is because vectorization of
reduction changes the order of the computation, which may result in
different behavior (instead of summing this way:
((a0+a1)+a2)+a3)+a4)+a5)+a6)+a7, we sum this way
(((a0+a2)+a4)+a6)+(((a1+a3)+a5)+a7)

 (2) Is there any pragma or assertion, etc, that I can put in the code to
 notify the compiler that certain pointers point to 16-byte aligned
data?
 -- Only the output array (C) is possibly misaligned in ATLAS


Not really, I'm afraid - there is something that's not entirely supported
in gcc yet - see details in PR20794.

dorit

 Thanks,
 Clint


 --


 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827




[Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3

2006-08-09 Thread dorit at il dot ibm dot com


--- Comment #55 from dorit at il dot ibm dot com  2006-08-09 19:10 ---
Subject: Re:  [4.0/4.1 Regression] gcc 4 produces worse x87 code
 on all platforms than gcc 3


 Here's some questions I need to figure out:
 (1) Why do I have to throw the -funsafe-math-optimizations flag to
 enable this?
-- I see where the .vect file warns of it, but it refers to an SSA
line,
   so I'm not sure what's going on.

This flag is needed in order to allow vectorization of reduction (summation
in your case) of floating-point data. This is because vectorization of
reduction changes the order of the computation, which may result in
different behavior (instead of summing this way:
((a0+a1)+a2)+a3)+a4)+a5)+a6)+a7, we sum this way
(((a0+a2)+a4)+a6)+(((a1+a3)+a5)+a7)

 (2) Is there any pragma or assertion, etc, that I can put in the code to
 notify the compiler that certain pointers point to 16-byte aligned
data?
 -- Only the output array (C) is possibly misaligned in ATLAS


Not really, I'm afraid - there is something that's not entirely supported
in gcc yet - see details in PR20794.

dorit

 Thanks,
 Clint


 --


 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827



-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827



[Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3

2006-08-09 Thread whaley at cs dot utsa dot edu


--- Comment #56 from whaley at cs dot utsa dot edu  2006-08-09 21:33 ---
Dorit,

This flag is needed in order to allow vectorization of reduction (summation
in your case) of floating-point data.

OK, but this is a bd flag to require.  From the computational scientist's
point of view, there is a *vast* difference between reordering (which many
aggressive optimizations imply) and failing to have IEEE compliance.  Almost no
computational scientist will use non-IEEE code (because you have essentially no
idea if your answer is correct), but almost all will allow reordering.  So, it
is  really important to separate the non-IEEE optimizations from the IEEE
compliant ones.

If vectorization requires me to throw a flag that says it causes non-IEEE
arithmetic, I can't use it, and neither can anyone other than, AFAIK, some
graphics guys.  IEEE is the contract between the user and the computer, that
bounds how much error there can be, and allows the programmer to know if a
given algorithm will produce a usable result.  Non-IEEE is therefore the
death-knell for having any theoretical or a priori understanding of accuracy. 
So, while reordering and non-IEEE may both seem unsafe, a reordering just gives
different results, which are still known to be within normal fp error, while
non-IEEE means there is no contract between the programmer at all, and indeed
the answer may be arbitrarily bad.  Further, behavior under exceptional
conditions is not maintained, and so the answer may actually be undetectably
nonsensical, not merely inaccurate.  Having an oddly colored pixel doesn't hurt
the graphics guy, but sending a satellite into the atmosphere, or registering
cancer in a clean MRI are rather more serious . . .  So, mixing the two
transformation types on one flag means that vectorization is unusable to what
must be the majority of it's audience.  Maybe I should open this as another bug
report flag mixes normal and catastrophic optimizations?

Not really, I'm afraid - there is something that's not entirely supported
in gcc yet - see details in PR20794

Hmm.  I'd tried the __attribute__, but I must have mistyped it, because it
didn't work before on pointers.  However, it just did in the MMBENCHV tarfile. 
However, the code still didn't use aligned load to access the vectors (using
multiple movlpd/movhpd instead) . . .  Even more scary, having the attribute
calls does not change the genned assembly at all.  Does the vectorization phase
get this alignment info passed to it?

Aligned loads can be as much as twice as fast as unaligned, and if you have to
choose amongst loops in the midst of a deep loop nest, these factors can
actually make vectorization a loser . . .

Thanks,
Clint


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827



Re: [Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3

2006-08-09 Thread Andrew Pinski
 
 
 
 --- Comment #56 from whaley at cs dot utsa dot edu  2006-08-09 21:33 
 ---
 Dorit,
 
 This flag is needed in order to allow vectorization of reduction (summation
 in your case) of floating-point data.
 
 OK, but this is a bd flag to require.  From the computational scientist's
 point of view, there is a *vast* difference between reordering (which many
 aggressive optimizations imply) and failing to have IEEE compliance.  Almost 
 no
 computational scientist will use non-IEEE code (because you have essentially 
 no
 idea if your answer is correct), but almost all will allow reordering.  So, it
 is  really important to separate the non-IEEE optimizations from the IEEE
 compliant ones.
Except for the fact IEEE compliant fp does not allow for reordering at all 
except
in some small cases.  For an example is (a + b) + (-a) is not the same as (a + 
(-a)) + b,
so reordering will invalid IEEE fp for larger a and small b.  Yes maybe we 
should split out
the option for unsafe math fp op for reordering but that is different issue.

-- Pinski


[Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3

2006-08-09 Thread pinskia at physics dot uc dot edu


--- Comment #57 from pinskia at physics dot uc dot edu  2006-08-09 21:46 
---
Subject: Re:  [4.0/4.1 Regression] gcc 4 produces worse x87 code on all
platforms than gcc 3

 
 
 
 --- Comment #56 from whaley at cs dot utsa dot edu  2006-08-09 21:33 
 ---
 Dorit,
 
 This flag is needed in order to allow vectorization of reduction (summation
 in your case) of floating-point data.
 
 OK, but this is a bd flag to require.  From the computational scientist's
 point of view, there is a *vast* difference between reordering (which many
 aggressive optimizations imply) and failing to have IEEE compliance.  Almost 
 no
 computational scientist will use non-IEEE code (because you have essentially 
 no
 idea if your answer is correct), but almost all will allow reordering.  So, it
 is  really important to separate the non-IEEE optimizations from the IEEE
 compliant ones.
Except for the fact IEEE compliant fp does not allow for reordering at all
except
in some small cases.  For an example is (a + b) + (-a) is not the same as (a +
(-a)) + b,
so reordering will invalid IEEE fp for larger a and small b.  Yes maybe we
should split out
the option for unsafe math fp op for reordering but that is different issue.

-- Pinski


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827



[Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3

2006-08-09 Thread whaley at cs dot utsa dot edu


--- Comment #58 from whaley at cs dot utsa dot edu  2006-08-09 23:01 ---
Andrew,

Except for the fact IEEE compliant fp does not allow for reordering at all
except
in some small cases.  For an example is (a + b) + (-a) is not the same as (a +
(-a)) + b,
so reordering will invalid IEEE fp for larger a and small b.  Yes maybe we
should split out
the option for unsafe math fp op for reordering but that is different issue.

Thanks for the response, but I believe you are conflating two issues (as is
this flag, which is why this is bad news).  Different answers to the question
what is this sum does not ruin IEEE compliance.  I am referring to IEEE 754,
which is a standard set of rules for storage and arithmetic for floating point
(fp) on modern hardware.  I am unaware of their being any rules on compilation.
 I.e.  whether re-orderings are allowed is beyond the standard.  It rather is a
set of rules that discusses for floating point operations (FLOPS) how rounding
must be done, how overflow/underflow must be handled, etc.  Perhaps there is
another IEEE standard concerning compilation that you are referring to?

Now of course, floating point arithmetic in general (and IEEE-compliant fp in
specific) is not associative, so indeed (a+b+c) != (c+b+a).  However, both
sequences are valid answers to what are these 3 things summed up, and both
are IEEE compliant if each addition is compliant.

What non-IEEE means is that the individual flops are no longer IEEE compliant. 
This means that overflow may not be handled, or exceptional conditions may
cause unknown results (eg., divide by zero), and indeed we have no way at all
of knowing what an fp add even means.  An example of a non-IEEE optimization is
using 3DNow! vectorization, because 3DNow! does not follow the IEEE standard
(for instance, it handles overflow only by saturation, which violates the
standard).  SSE (unless you turn IEEE compliance off manually) is IEEE
compliant, and this is why you see computational guys like myself using it, and
not using 3DNow!.

To a computational scientist, non-IEEE is catastophic, and may change the
answer is not.  May change the answer in this case simply means that I've
got a different ordering, which is also a valid IEEE fp answer, and indeed may
be a better answer than the original ordering (depending on the data; no way
to know this w/o looking at the data).  Non-IEEE means that I have no way of
knowing what kind of rounding was done, how flop was done, if underflow (or
gradual overflow!) occurred, etc.  It is for this reason that optimizations
which are non-IEEE are a killer for computational scientists, and reorders are
no big deal.  In the first you have no idea what has happened with the data,
and in the second you have an IEEE-compliant answer, which has known
properties.

It has been my experience that most compiler people (and I have some experience
there, as I got my PhD in compilation) are more concerned with integer work,
and thus not experts on fp computation.  I've done fp computational work for
the majority of my research for the last decade, so I thought I might be able
to provide useful input to bridge the camps, so to speak.  In this case, I
think that by lumping cause different IEEE-compliant answers in with use
non-IEEE arithmetic you are preventing all serious fp users from utilizing the
optimizations.  Since vectorization is of great importance on modern machines,
this is bad news.  Obviously, I may be wrong in what I say, but if reordering
makes something non-IEEE I'm going to have some students mad at me for teaching
them the wrong stuff :)

Has this made my point any clearer, or do you still think I am wrong?  If I'm
wrong, maybe you can point to the part of the IEEE standard that discusses
orderings violating the standard (as opposed to the well-known fact that all
implemented fp arithemetic is non-associative)?  After you do this, I'll have
to dig up my copy of the thing, which I don't think I've seen in the last 2
years (but I did scope some of books that cover it, and didn't find anything
about compilation).

Thanks,
Clint


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827



[Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3

2006-08-08 Thread hubicka at gcc dot gnu dot org


--- Comment #46 from hubicka at gcc dot gnu dot org  2006-08-08 06:15 
---
In x86/x86-64 world one can be almost sure that the load+execute instruction
pair will execute (marginaly to noticeably) faster than move+load-and-execute
instruction pair as the more complex instructions are harder for on-chip
scheduling (they retire later).
Perhaps we can move such a transformation somewhere more generically perhaps to
post-reload copyprop?

Honza


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827



Re: [Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3

2006-08-08 Thread Jan Hubicka
 In x86/x86-64 world one can be almost sure that the load+execute instruction
 pair will execute (marginaly to noticeably) faster than move+load-and-execute
 instruction pair as the more complex instructions are harder for on-chip
 scheduling (they retire later).
   ^^^ retirement filling up the scheduler
   easilly.
 Perhaps we can move such a transformation somewhere more generically perhaps 
 to
 post-reload copyprop?
 
 Honza


[Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3

2006-08-08 Thread hubicka at ucw dot cz


--- Comment #47 from hubicka at ucw dot cz  2006-08-08 06:28 ---
Subject: Re:  [4.0/4.1 Regression] gcc 4 produces worse x87 code on all
platforms than gcc 3

 In x86/x86-64 world one can be almost sure that the load+execute instruction
 pair will execute (marginaly to noticeably) faster than move+load-and-execute
 instruction pair as the more complex instructions are harder for on-chip
 scheduling (they retire later).
   ^^^ retirement filling up the scheduler
   easilly.
 Perhaps we can move such a transformation somewhere more generically perhaps 
 to
 post-reload copyprop?
 
 Honza


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827



[Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3

2006-08-08 Thread paolo dot bonzini at lu dot unisi dot ch


--- Comment #48 from paolo dot bonzini at lu dot unisi dot ch  2006-08-08 
07:05 ---
Subject: Re:  [4.0/4.1 Regression] gcc 4 produces worse
 x87 code on all platforms than gcc 3


 In x86/x86-64 world one can be almost sure that the load+execute instruction
 pair will execute (marginaly to noticeably) faster than move+load-and-execute
 instruction pair as the more complex instructions are harder for on-chip
 scheduling (they retire later).
Yes, so far so good and this part has already been committed.  But does 
a *single* load-and-execute instruction execute faster than the two 
instructions in a load+execute sequence?


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827



[Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3

2006-08-08 Thread whaley at cs dot utsa dot edu


--- Comment #49 from whaley at cs dot utsa dot edu  2006-08-08 16:43 ---
Paolo,

Yes, so far so good and this part has already been committed.  But does
a *single* load-and-execute instruction execute faster than the two
instructions in a load+execute sequence?

As I said, in my hand-tuned SSE assembly experience, which is faster depends on
the architecture.  In particular, netburst or Core do well with the final
fmul[ls], and other archs do not.  My guess is that netburst and Core probably
crack this single instruction in two during decode, which allows the implicit
load to be advanced, but with less instruction load.  I think other
architectures do not split the inst during decode, which means that tomasulo's
cannot advance the load due to dependencies, which makes the separate
instructions faster, even in the face of the extra instruction.

If you can give me a patch that makes gcc call a new peephole opt getting rid
of the final mul[sl] only when a certain flag is thrown, I will see if I can't
post timings across a variety of architectures using both ways, so we can see
if my SSE experience is true for x87, and how strong the performance benefit
for various architectures.  This will allow us to evaluate how important
getting this choice is, what should be the default state, and how we should
vary it according to architecture.  My own theoretical guess is that if you
*have* to pick a behavior, surely separate instructions are better: on systems
with the cracking, this extra inst at worst eats up some mem and a bit of
decode bandwidth, which on most machines is not critical.  On the other hand,
having a non-advancable load is pretty bad news on systems w/o the cracking
ability.  The proposed timings could demonstrate the accuracy of this guess.

As I mentioned, and I *think* Jan echoed, for the case you have already fixed,
the peephole's way should be the default way, even at low optimization: there's
no extra instruction to this peephole, and it is better everywhere we've timed,
and I see no way in theory for the first sequence to be better.

Thanks,
Clint


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827



[Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3

2006-08-08 Thread whaley at cs dot utsa dot edu


--- Comment #50 from whaley at cs dot utsa dot edu  2006-08-08 18:36 ---
Guys,

I've been scoping this a little closer on the Athlon64X2.  I have found that
the patched gcc can achieve as much as 93% of theoretical peak (5218Mflop on a
2800Mhz Athlon64X2!) for in-cache matmul when the code generator is allowed to
go to town.  That at least ties the best I've ever seen for an x86 chip, and
what it means is that on this architecture, the x87 unit can be coaxed into
beating the SSE unit *even when the SSE instructions are fully vectorized* (for
double precision only, of course: vector single prec SSE has twice theoretical
peak of x87).  This also means that ATLAS should get a real speed boost when
the new gcc is released, and other fp packages have the potential to do so as
well.  So, with this motivation, I edited the genned assembly, and made the
following changes by hand in ~30 different places in the kernel assembly:

#ifdef FMULL
fmull   1440(%rcx)
#else
fldl1440(%rcx)
fmulp   %st,%st(1)
#endif

To my surprise, on this arch, using the fldl/fmulp pair caused a performance
drop.  So, either my SSE experience does not necessarily translate to x87, or
the Opteron (where I did the SSE tuning) is subtly different than the
Athlon64X2, or my memory of the tuning is faulty.  Just as a check, Paulo: is
this the peephole you would do?

Anyway, doing this by hand is too burdensome to make widespread timings
feasable, so if you'd like to see that, I'll need a gcc patch to do it
automatically . . .

Cheers,
Clint


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827



[Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3

2006-08-08 Thread paolo dot bonzini at lu dot unisi dot ch


--- Comment #51 from paolo dot bonzini at lu dot unisi dot ch  2006-08-09 
04:33 ---
Subject: Re:  [4.0/4.1 Regression] gcc 4 produces worse
 x87 code on all platforms than gcc 3


 I've been scoping this a little closer on the Athlon64X2.  I have found that
 the patched gcc can achieve as much as 93% of theoretical peak (5218Mflop on a
 2800Mhz Athlon64X2!) for in-cache matmul when the code generator is allowed to
 go to town.
Not unexpected.  Code was so tightly tuned for GCC 3, and so big were 
the changes between GCC 3 and 4, that you were comparing sort of apples 
to oranges.  It could be interesting to see which different 
optimizations are performed by your code generator for GCC 3 vs. GCC 4.
fmull   1440(%rcx)
 #else
fldl1440(%rcx)
fmulp   %st,%st(1)
 #endif
 
 To my surprise, on this arch, using the fldl/fmulp pair caused a performance
 drop.  So, either my SSE experience does not necessarily translate to x87, or
 the Opteron (where I did the SSE tuning) is subtly different than the
 Athlon64X2, or my memory of the tuning is faulty.  Just as a check, Paulo: is
 this the peephole you would do?
   
In some sense, this is the peephole I would rather *not* do.  But the 
answer is yes. :-)

So, do you now agree that the bug would be fixed if the patch that is in 
GCC 4.2 was backported to GCC 4.1 (so that your users can use that)?

And do you still see the abysmal x87 single-precision FP performance?

Thanks!


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827



[Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3

2006-08-07 Thread bonzini at gnu dot org


--- Comment #37 from bonzini at gnu dot org  2006-08-07 06:19 ---
I don't see how the last fmul[sl] can be removed without increasing code size. 
The only way to fix it would be to change the machine description to say that
this processor does not like FP operations with a memory operand.  With a
peephole, this is as good as we can get it.  The last fmul is not coupled with
a fld %st because it consumes the stack entry.  See in comment #30, where
there is still a fmull b.

Can you please try re-running the tests?  It takes skill^W^W seems quite weird
to have a 100x slow-down, also because my tests were run on a similar Prescott
(P4e).

It also would be interesting to re-run your code generator on a compiler built
from svn trunk.  If it can provide higher performance, you'd be satisfied I
guess even if it comes from a different kernel.  Also, I strongly believe that
you should implement vectorization, or at least find out *why* GCC does not
vectorize your code.  It may be simply that it does not have any guarantee on
the alignment.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827



[Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3

2006-08-07 Thread whaley at cs dot utsa dot edu


--- Comment #38 from whaley at cs dot utsa dot edu  2006-08-07 15:32 ---
Paolo,

Thanks for all the help.  I'm not sure I understand everything perfectly
though, so there's some questions below . . .

I don't see how the last fmul[sl] can be removed without increasing code size.

Since the flags are asking for performance, not size optimization, this should
only be an argument if the fmul[s,l]'s are performance-neutral.  A lot of
performance optimizations increase code size, after all . . .  Obviously, no
fmul[sl] is possible, since gcc 3 achieves it.  However, I can see that the
peephole phase might not be able to change the register usage.

Can you please try re-running the tests?  It takes skill^W^W

Yes, I found the results confusing as well, which is why I reran them 50 times
before posting.  I also posted the tarfile (wt Makefile and assemblies) that
built them, so that my mistakes could be caught by someone with more skill. 
Just as a check, maybe you can confirm the .s you posted is the right one?  I
can't find the loads of the matrix C anywhere in its assembly, and I can find
them in the double version  . . .  Anyway, I like your suggestion (below) of
getting the compiler so we won't have to worry about assemblies, so that's
probably the way to go.  On this front, is there some reason you cannot post
the patch(es) as attachments, just to rule out copy problems, as I've asked in
last several messages?  Note there's no need if I can grab your stuff from SVN,
as below . . .

because my tests were run on a similar Prescott (P4e)

You didn't post the gcc 3 performance numbers.  What were those like?  If
you beat/tied gcc 3, then the remaining fmul[l,s] are probably not a big
deal.  If gcc 3 is still winning, on the other hand . . .

It also would be interesting to re-run your code generator on a compiler built 
from svn trunk.

Are your changes on a branch I could check out?  If so, give me the commands to
get that branch, as we are scoping assemblies only because of the patching
problem.  Having a full compiler would indeed enable more detailed
investigations, including loosing the full code generator on the improved
compiler.

Also, I strongly believe that you should implement vectorization,

ATLAS implements vectorization, by writing the entire GEMM kernel in assembly
and directly using SSE.  However, there are cases where generated C code must
be called, and that's where gcc comes in . . .

or at least find out *why* GCC does not vectorize your code. It may be simply 
that it does not have any guarantee on the alignment.

I'm all for this.  info gcc says that w/o a guarantee of alignment, loops are
duped, with an if selecting between vector and scalar loops, is this not
accurate?  I spent a day trying to get gcc to vectorize any of the generator's
loops, and did not succeed (can you make it vectorize the provided benchmark
code?).  I also tried various unrollings of the inner loop, particularly no
unrolling and unroll=2 (vector length).  I was unable to truly decipher the
warning messages explaining the lack of vectorization, and I would truly
welcome some help in fixing this.

This is a separate issue from the x87 code, and this tracker item is already
fairly complex :) I'm assuming if I attempted to open a bug tracker of gcc
will not vectorize atlas's generated code it would be closed pretty quickly. 
Maybe you can recommend how to approach this, or open another report that we
can exchange info on?  I would truly appreciate the opportunity to get some
feedback from gcc authors to help guide me to solving this problem.

Thanks for all the info,
Clint


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827



[Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3

2006-08-07 Thread whaley at cs dot utsa dot edu


--- Comment #39 from whaley at cs dot utsa dot edu  2006-08-07 16:47 ---
Paolo,

OK, never mind about all the questions on assembly/patches/SVN/gcc3 perf: I
checked out the main branch, and vi'd the patched file, and I see that your
patch is there.  I am presently building the SVN gcc on several machines, and
will be posting results/issues as they come in . . .

I would still be very interested in advice on approaching the vectorization
problem as discussed at the end of the mail.

Thanks,
Clint


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827



[Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3

2006-08-07 Thread paolo dot bonzini at lu dot unisi dot ch


--- Comment #40 from paolo dot bonzini at lu dot unisi dot ch  2006-08-07 
16:58 ---
Subject: Re:  [4.0/4.1 Regression] gcc 4 produces worse
 x87 code on all platforms than gcc 3


 I don't see how the last fmul[sl] can be removed without increasing code 
 size.
 
 However, I can see that the
 peephole phase might not be able to change the register usage.
Actually, the peephole phase may not change the register usage, but it 
could peruse a scratch register if available.  But it would be much more 
controversial (even if backed by your hard numbers on ATLAS) to state 
that splitting fmul[sl] to fld[sl]+fmul is always beneficial, unless 
there is some manual telling us exactly that... for example it would be 
a different story if it could give higher scheduling freedom (stuff like 
VectorPath vs. DirectPath on Athlons), and if we could figure out on 
which platforms it improves performance.
 On this front, is there some reason you cannot post
 the patch(es) as attachments, just to rule out copy problems, as I've asked in
 last several messages?  Note there's no need if I can grab your stuff from 
 SVN,
 as below . . .
   
You already found about this :-P

Unfortunately I mistyped the PR number when I committed the patch; I 
meant the commit to appear in the audit trail, so that you'd have seen 
that I had committed it.
 because my tests were run on a similar Prescott (P4e)
 
 You didn't post the gcc 3 performance numbers.  What were those like?  If
 you beat/tied gcc 3, then the remaining fmul[l,s] are probably not a big
 deal.  If gcc 3 is still winning, on the other hand . . .
   
I don't have GCC 3 on that machine.

Paolo


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827



[Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3

2006-08-07 Thread whaley at cs dot utsa dot edu


--- Comment #41 from whaley at cs dot utsa dot edu  2006-08-07 17:19 ---
Paolo,

Actually, the peephole phase may not change the register usage, but it
could peruse a scratch register if available.  But it would be much more
controversial (even if backed by your hard numbers on ATLAS) to state
that splitting fmul[sl] to fld[sl]+fmul is always beneficial, unless

We'll have to see how this is in x87 code.  I have experience with it in SSE,
where doing it is fully a target issue.  For instance, the P4E likes you to
avoid the explicit load on the end, where the Hammer prefers the explicit load.
 If I recall right, there is a *slight* advantage on the intel to the from-mem
instruction, but I can't remember how much difference doing the separate
load/use made on the AMD.  We should get some idea by comparing gcc3 vs. your
patched compiler on the various platforms, though other gcc3/4 changes will
cloud the picture somewhat . . .

If this kind of machine difference in optimality holds true for x87 as well, I
assume a new peephole phase that looks for the scratch register could be called
if the appropriate -march were thrown?

Speaking of -march issues, when I get a compiler build that gens your new code,
I will pull the assembly trick to try it on the CoreDuo as well.  If the new
code is worse, you can probably not call your present peephole if that -march
is thrown?

Thanks,
Clint


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827



[Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3

2006-08-07 Thread paolo dot bonzini at lu dot unisi dot ch


--- Comment #42 from paolo dot bonzini at lu dot unisi dot ch  2006-08-07 
18:19 ---
Subject: Re:  [4.0/4.1 Regression] gcc 4 produces worse
 x87 code on all platforms than gcc 3


 We should get some idea by comparing gcc3 vs. your
 patched compiler on the various platforms, though other gcc3/4 changes will
 cloud the picture somewhat . . .
   
That's why you should compare 4.2 before and after my patch, instead.
 If this kind of machine difference in optimality holds true for x87 as well, I
 assume a new peephole phase that looks for the scratch register could be 
 called
 if the appropriate -march were thrown?
   
Or you can disable the fmul[sl] instructions altogether.
 Speaking of -march issues, when I get a compiler build that gens your new 
 code,
 I will pull the assembly trick to try it on the CoreDuo as well.  If the new
 code is worse, you can probably not call your present peephole if that -march
 is thrown?
   
I'd find it very strange.  It is more likely that the Core Duo has a 
more powerful scheduler (maybe the micro-op fusion thing?) that does not 
dislike fmul[sl].


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827



[Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3

2006-08-07 Thread dorit at il dot ibm dot com


--- Comment #43 from dorit at il dot ibm dot com  2006-08-07 20:35 ---
 I'm all for this.  info gcc says that w/o a guarantee of alignment, loops are
 duped, with an if selecting between vector and scalar loops, is this not
 accurate?  

yes

I spent a day trying to get gcc to vectorize any of the generator's
 loops, and did not succeed (can you make it vectorize the provided benchmark
 code?).  

The aggressive unrolling in the provided example seems to be the first obstacle
to vectorize the code

 I also tried various unrollings of the inner loop, particularly no
 unrolling and unroll=2 (vector length).  I was unable to truly decipher the
 warning messages explaining the lack of vectorization, and I would truly
 welcome some help in fixing this.

I'd be happy to help decipher the vectorizer's dump file. please send the
un-unrolled version and the dump file generated by -fdump-tree-vect-details,
and I'll see if I can help.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827



[Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3

2006-08-07 Thread whaley at cs dot utsa dot edu


--- Comment #44 from whaley at cs dot utsa dot edu  2006-08-07 21:56 ---
Guys,

OK, the mystery of why my hand-patched gcc didn't work is now cleared up.  My
first clue was that neither did the SVN-build gcc!  Turns out, your peephole
opt is only done if I throw the flag -O3 rather than -O, which is what my
tarfile used.  Any reason it's done at only the high levels, since it makes
such a performance difference?

FYI, in gcc3 -O gets better performance than -O3, which is why that's my
default flags.  However, it appears that gcc4 gets very nice performance with
-O3.  Its fairly common for -O to give better performance than -O3, however
(since the ATLAS code is already aggressively optimized, gcc's max optimization
often de-optimize an optimal code), so turning this on at the default level, or
being able to turn it off and on manually would be ideal . . .

That's why you should compare 4.2 before and after my patch, instead.

Yeah, except 4.2 w/o your patch has horrible performance.  Our goal is not to
beat horrible performance, but rather to get good performance!  Gcc 3 provides
a measure of good performance.  However, I take your point that it'd be nice to
see the new stuff put a headlock on the crap performance, so I include that
below as well :)

Here's some initial data.  I report MFLOPS achieved by the kernel as compiled
by : gcc3 (usually gcc 3.2 or 3.4.3), gccS (current SVN gcc), and gcc4 (usually
gcc 4.1.1).  I will try to get more data later, but this is pretty suggestive,
IMHO.

  DOUBLESINGLE
  PEAKgcc3/gccS/gcc4gcc3/gccS/gcc4
  ====
Pentium-D :   28002359/2417/20672685/2684/2362
Ath64-X2  :   56003677/3585/21023680/3914/2207
Opteron   :   32002590/2517/15072625/2800/1580

So, it appears to me we are seeing the same pattern I previously saw in my
hand-tuned SSE code: Intel likes the new pattern of doing the last load as part
of the FMUL instruction, but AMD is hampered by it.  Note that gccS is the best
compiler for both single  double on the Intel. On both AMD machines, however,
it wins only for single, where the cost of the load is lower.  It loses to gcc3
for double, where load performance more completely determines matmul
performance.  This is consistant with the view that gcc 4 does some other
optimizations better than gcc 3, and so if we got the fldl removed, gcc 4 would
win for all precisions . . .

Don't get me wrong, your patch has already removed the emergency: in the worst
case so far you are less than 3% slower.  However, I suspect if we added the
optional (for amd chips only) peephole step to get rid of all possible
fmul[s,l], then we'd win for double, and win even more for single on AMD chips
. . .  So, any chance of an AMD-only or flag-controlled peephole step to get
rid of the last fmul[s,l]?

Or you can disable the fmul[sl] instructions altogether.

As I mentioned, my own hand-tuning has indicated that the final fmul[sl] is
good for Intel netburst archs, but bad for AMD hammer archs.

I'll see about posting some vectorization data ASAP.  Can someone create a new
bug report so that the two threads of inquiry don't get mixed up, or do you
want to just intermix them here?

Thanks,
Clint

P.S.: I tried to run this on the Core by hand-translating gccS-genned assembly
to OS X assembly.  The double precision gccS runs at the same speed as apple's
gcc.  However, the single precision is an order of magnitude slower, as I
experienced this morning on the P4E.  This is almost certainly an error in my
makefile, but damned if I can find it.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827



[Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3

2006-08-07 Thread whaley at cs dot utsa dot edu


--- Comment #45 from whaley at cs dot utsa dot edu  2006-08-08 02:59 ---
Guys,

OK, with Dorit's -fdump-tree-vect-details, I made a little progress on
vectorization.  In order to get vectorization to work, I had to add the flag
'-funsafe-math-optimizations'.  I will try to create a tarfile with everything
tomorrow so you guys can see all the output, but is it normal to need to throw
this to get vectorization?  SSE is IEEE compliant (unless you turn it off), and
ATLAS needs to stay IEEE, so I can't turn on unsafe-math-opt in general . . .

With these flags, gcc can vectorize the kernel if I do no unrolling at all.  I
have not yet run the full search on with these flags, but I've done quite a few
hand-called cases, and the performance is lower than either the x87 (best) or
scalar SSE for double on both the P4E and Ath64X2.  For single precision, there
is a modest speedup over the x87 code on both systems, but the total is *way*
below my assembly SSE kernels.

I just quickly glanced at the code, and I see that it never uses movapd from
memory, which is a key to getting decent performance.  ATLAS ensures that the
input matrices (A  B) are 16-byte aligned.  Is there any pragma/flag/etc I can
set that says pointer X points to data that is 16-byte aligned?

Thanks,
Clint


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827



[Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3

2006-08-06 Thread whaley at cs dot utsa dot edu


--- Comment #36 from whaley at cs dot utsa dot edu  2006-08-06 15:03 ---
Paola,

Thanks for working on this.  We are making progres, but I have some mixed
results.  I timed the assemblies you provided directly.  I added a target
asgexe that builds the same benchmark, assuming assembly source instead of C
to make this more reproducable.  I ran on the Athlon-64X2, where your new
assembly ran *faster* than gcc 3 for double precision.  However, you still lost
for single precision.  I believe the reason is that you still have more
fmuls/fmull (fmul from memory) than does gcc 3:

animalfgrep -i fmuls smm_4.s | wc
240 4804051
animalfgrep -i fmuls smm_asg.s | wc
 60 1201020
animalfgrep -i fmuls smm_3.s  | wc
  0   0   0
animalfgrep -i fmull dmm_4.s | wc
100 2001739
animalfgrep -i fmull dmm_asg.s | wc
 20  40 360
animalfgrep -i fmuls dmm_3.s | wc
  0   0   0


I haven't really scoped out the dmm diff, but in single prec anyway, these
dreaded fmuls are in the inner loop, and this is probably why you are still
losing.  I'm guessing your peephole is missing some cases, and for some reason
is missing more under single.  Any ideas?

As for your assembly actually beating gcc 3 for double, my guess is that it is
some other optimization that gcc 4 has, and you will beat by even more once the
final fmull are removed . . .

On the P4e, your double precision code is faster than stock gcc 4, but still
slower than gcc3.  again, I suspect the remaining fmull.  Then comes the thing
I cannot explain at all.  Your single precision results are horrible.  gcc 3
gets 1991MFLOPS, gcc 4 gets 1664, and the assembly you sent gets 34!  No chance
the mixed fld/fmuls is causing stack overflow, I guess?  I think this might
account for such a catastrophic drop  . . .  That's about the only WAG I've got
for this behavior.

Anyway, I think the first order of business may be to get your peephole to
grabbing all the cases, and see if that makes you win everywhere on Athlon, and
if it makes single precision P4e better, and we can go from there . . .

If you do that, attach the assemblies  again, and I'll redo timings.  Also, if
you could attach (not put in comment) the patch, it'd be nice to get the
compiler, so I could test x86-64 code on Athlon, etc.

Thanks,
Clint


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827