Re: [gentoo-user] gcc optimizations

Javier Villavicencio Tue, 28 Oct 2003 14:22:41 -0800

These are the results of benchmarking gcc optimizations compiling povray 
(www.povray.org) using the benchmark.ini and the skyvase.pov from the unofficial 
benchmarks pages.
Of course that this isn't so accurate about timings (I should have used some more time 
consuming render, but I liked this one) what I did to being more fair with results is 
I runned again and again each compilation (more than 20 times) and I posted here the 
-fastest- of these timings (from the 20 runs, the faster one, for each compilation). 
And I used the "time" command because I didn't like the accuracy of the povray timing 
(not showing milliseconds, only seconds).
(also read the part about branch probabilities, if exist a way to add this to gentoo, 
then gentoo will run faster than WARP13 :+)


Commandline: "time nice -n -20 povray skyvase.pov" (using benchmark.ini)

CFLAGS= -O3 -march=athlon-xp -fomit-frame-pointer
real    0m3.156s
user    0m2.996s
sys     0m0.161s
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -fomit-frame-pointer
real    0m3.002s
user    0m2.846s
sys     0m0.157s
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -finline-functions -fomit-frame-pointer   <- -O3 added
real    0m3.197s
user    0m3.039s
sys     0m0.158s
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -frename-registers -fomit-frame-pointer   <- -O3 added ! 
this is the fast one !
real    0m2.993s
user    0m2.834s
sys     0m0.159s
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -frename-registers -mpreferred-stack-boundary=2 \ <- 
slower ?
        -fomit-frame-pointer
real    0m3.326s
user    0m3.158s
sys     0m0.168s
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -frename-registers -mpreferred-stack-boundary=4 \ <- 
RTFM, implied default
        -fomit-frame-pointer
real    0m2.996s
user    0m2.834s
sys     0m0.162s
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -frename-registers -mpreferred-stack-boundary=8 \ <- I 
already RTFM, slower, ok.
        -fomit-frame-pointer
real    0m3.021s
user    0m2.860s
sys     0m0.162s
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -frename-registers -malign-double \        <- I didn't 
added -mpreferred... bcos is implied
        -fomit-frame-pointer                                            <- Now 
-malign-double FASTER!
real    0m2.959s
user    0m2.802s
sys     0m0.158s
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -frename-registers -malign-double \        <- almost same 
as before, new flag implied 
        -m96bit-long-double -fomit-frame-pointer                
real    0m2.982s
user    0m2.802s
sys     0m0.181s
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -frename-registers -malign-double \        <- 128bit long 
double slower.
         -m128bit-long-double -fomit-frame-pointer      
real    0m3.018s
user    0m2.858s
sys     0m0.161s
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -frename-registers -malign-double \        <- almost the 
same as without -mmx, implied?
        -mmmx -fomit-frame-pointer
real    0m2.969s
user    0m2.802s
sys     0m0.167s
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -frename-registers -malign-double \        <- again, 
maybe implied?
         -mmx -msse -fomit-frame-pointer                        
real    0m2.965s
user    0m2.803s
sys     0m0.162s
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -frename-registers -malign-double \        <- no 
noticable effect yet,
        -mmx -msse -m3dnow -fomit-frame-pointer                         <- maybe 
implied?
real    0m2.962s
user    0m2.803s
sys     0m0.159s
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -frename-registers -malign-double  \       <- what 
happens without mmx?
        -msse -m3dnow -fomit-frame-pointer                              <- nothing :+/
real    0m2.964s
user    0m2.802s
sys     0m0.162s
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -frename-registers -malign-double  \       <- and without 
sse?
        -m3dnow -fomit-frame-pointer                                    <- bah, 
nothing :+/
real    0m2.974s
user    0m2.805s
sys     0m0.169s
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -frename-registers -malign-double  \       <- i was 
reading the info...
        -mno-push-args -fomit-frame-pointer                             <- and I found 
this... not too much, and I don't like it :+P.
real    0m2.972s
user    0m2.804s
sys     0m0.168s
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -frename-registers -malign-double  \       <- this 
implies the last one, well, see what happens..
        -maccumulate-outgoing-args -fomit-frame-pointer                 <- faster, but 
bigger code size. (not a lot of space here)
real    0m2.969s
user    0m2.799s
sys     0m0.170s
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -frename-registers -malign-double  \       <- Huh, 
faster, huh.
        -maccumulate-outgoing-args -mno-align-stringops \
        -fomit-frame-pointer
real    0m2.948s
user    0m2.781s
sys     0m0.168s
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -frename-registers -malign-double  \       <- again i'm 
reading the info...
        -maccumulate-outgoing-args -mno-align-stringops \               <- 17ms 
slower. bah.
        -minline-all-stringops -fomit-frame-pointer                                    
 
real    0m2.968s
user    0m2.798s
sys     0m0.170s
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -frename-registers -malign-double  \       <- -fforce-mem 
in -O2...
        -maccumulate-outgoing-args -mno-align-stringops \               <- what about 
-fforce-addr?
        -fforce-addr -fomit-frame-pointer                               <- mbu. slower.
real    0m3.132s
user    0m2.970s
sys     0m0.162s
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -frename-registers -malign-double  \       <- 
-fbranch-count-reg is enabled with -O2
        -maccumulate-outgoing-args -mno-align-stringops \               <- what 
happens disabling this?
        -fno-branch-count-reg -fomit-frame-pointer                      <- uhm, it's 
enabled for a good reason (:+P)
real    0m2.958s
user    0m2.794s
sys     0m0.164s
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -frename-registers -malign-double  \       <- slow like 
hell.
        -maccumulate-outgoing-args -mno-align-stringops \
        -fmove-all-movables -freduce-all-givs -freduce-all-givs -fomit-frame-pointer   
                                 
real    0m3.198s
user    0m3.038s
sys     0m0.160s
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -frename-registers -malign-double  \       <- this one 
generates imprecise math code
        -maccumulate-outgoing-args -mno-align-stringops \               <- but not so 
imprecise ;+P
        -ffast-math -fomit-frame-pointer
real    0m3.043s
user    0m2.881s
sys     0m0.162s
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -frename-registers -malign-double  \       <- let's play 
with -fpmath
        -maccumulate-outgoing-args -mno-align-stringops \               <- sse: slower
        -fpmath=sse -fomit-frame-pointer
real    0m3.048s
user    0m2.890s
sys     0m0.158s
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -frename-registers -malign-double  \       <- 387: 
        -maccumulate-outgoing-args -mno-align-stringops \               <- mmm, 
better...
        -fpmath=387 -fomit-frame-pointer
real    0m2.952s
user    0m2.788s
sys     0m0.164s
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -frename-registers -malign-double  \       <- sse,387: 
        -maccumulate-outgoing-args -mno-align-stringops \               <- b00, 
slower...
        -fpmath=sse,387 -fomit-frame-pointer
real    0m3.104s
user    0m2.941s
sys     0m0.163s
*********************************************************************
************************branch probabilities*****************************
*********************************************************************
This is the end of the CFLAGS that gentoo can take, the following works in this way:
You first compile a program with -fprofile-arcs, then run the program a while. When 
you do this, the program runs slower than hell, but don't worry, it's creating
information at the side of your already compiled code about branch probabilities,
(without this GCC does random branch prediction, with this GCC is writing the branch
flow to a .da file (with the same name of the .c/.o file that it's being executed, so 
DON'T delete your directory with the source code)
After -fprofile-arcs, and running the compiled program, you have to recompile it again
with -fbranch-probabilities, and the compiler will get branch data from the already
generated .da files and make the code run in the directions of the most commonly,
and time consuming, code. Just looks what happens:
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -frename-registers -malign-double  \       <- now, the 
real part. profiling.
        -maccumulate-outgoing-args -mno-align-stringops \               <- first we 
compile with -fprofile-arcs
        -fpmath=387 -fprofile-arcs -fomit-frame-pointer                         <- 
(compile with -p and use gprof to see nice stats)
real    0m4.048s
user    0m3.882s
sys     0m0.166s
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -frename-registers -malign-double  \       <- now, gcc is 
using the profiled data
        -maccumulate-outgoing-args -mno-align-stringops \               <- what can be 
faster than this?? :+)
        -fpmath=387 -fbranch-probabilities -fomit-frame-pointer                        
 
real    0m2.900s
user    0m2.733s
sys     0m0.167s
*********************************************************************

--
[EMAIL PROTECTED] mailing list

Re: [gentoo-user] gcc optimizations

Reply via email to