These are the results of benchmarking gcc optimizations compiling povray
(www.povray.org) using the benchmark.ini and the skyvase.pov from the unofficial
benchmarks pages.
Of course that this isn't so accurate about timings (I should have used some more time
consuming render, but I liked this one) what I did to being more fair with results is
I runned again and again each compilation (more than 20 times) and I posted here the
-fastest- of these timings (from the 20 runs, the faster one, for each compilation).
And I used the "time" command because I didn't like the accuracy of the povray timing
(not showing milliseconds, only seconds).
(also read the part about branch probabilities, if exist a way to add this to gentoo,
then gentoo will run faster than WARP13 :+)
Commandline: "time nice -n -20 povray skyvase.pov" (using benchmark.ini)
CFLAGS= -O3 -march=athlon-xp -fomit-frame-pointer
real 0m3.156s
user 0m2.996s
sys 0m0.161s
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -fomit-frame-pointer
real 0m3.002s
user 0m2.846s
sys 0m0.157s
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -finline-functions -fomit-frame-pointer <- -O3 added
real 0m3.197s
user 0m3.039s
sys 0m0.158s
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -frename-registers -fomit-frame-pointer <- -O3 added !
this is the fast one !
real 0m2.993s
user 0m2.834s
sys 0m0.159s
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -frename-registers -mpreferred-stack-boundary=2 \ <-
slower ?
-fomit-frame-pointer
real 0m3.326s
user 0m3.158s
sys 0m0.168s
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -frename-registers -mpreferred-stack-boundary=4 \ <-
RTFM, implied default
-fomit-frame-pointer
real 0m2.996s
user 0m2.834s
sys 0m0.162s
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -frename-registers -mpreferred-stack-boundary=8 \ <- I
already RTFM, slower, ok.
-fomit-frame-pointer
real 0m3.021s
user 0m2.860s
sys 0m0.162s
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -frename-registers -malign-double \ <- I didn't
added -mpreferred... bcos is implied
-fomit-frame-pointer <- Now
-malign-double FASTER!
real 0m2.959s
user 0m2.802s
sys 0m0.158s
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -frename-registers -malign-double \ <- almost same
as before, new flag implied
-m96bit-long-double -fomit-frame-pointer
real 0m2.982s
user 0m2.802s
sys 0m0.181s
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -frename-registers -malign-double \ <- 128bit long
double slower.
-m128bit-long-double -fomit-frame-pointer
real 0m3.018s
user 0m2.858s
sys 0m0.161s
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -frename-registers -malign-double \ <- almost the
same as without -mmx, implied?
-mmmx -fomit-frame-pointer
real 0m2.969s
user 0m2.802s
sys 0m0.167s
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -frename-registers -malign-double \ <- again,
maybe implied?
-mmx -msse -fomit-frame-pointer
real 0m2.965s
user 0m2.803s
sys 0m0.162s
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -frename-registers -malign-double \ <- no
noticable effect yet,
-mmx -msse -m3dnow -fomit-frame-pointer <- maybe
implied?
real 0m2.962s
user 0m2.803s
sys 0m0.159s
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -frename-registers -malign-double \ <- what
happens without mmx?
-msse -m3dnow -fomit-frame-pointer <- nothing :+/
real 0m2.964s
user 0m2.802s
sys 0m0.162s
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -frename-registers -malign-double \ <- and without
sse?
-m3dnow -fomit-frame-pointer <- bah,
nothing :+/
real 0m2.974s
user 0m2.805s
sys 0m0.169s
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -frename-registers -malign-double \ <- i was
reading the info...
-mno-push-args -fomit-frame-pointer <- and I found
this... not too much, and I don't like it :+P.
real 0m2.972s
user 0m2.804s
sys 0m0.168s
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -frename-registers -malign-double \ <- this
implies the last one, well, see what happens..
-maccumulate-outgoing-args -fomit-frame-pointer <- faster, but
bigger code size. (not a lot of space here)
real 0m2.969s
user 0m2.799s
sys 0m0.170s
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -frename-registers -malign-double \ <- Huh,
faster, huh.
-maccumulate-outgoing-args -mno-align-stringops \
-fomit-frame-pointer
real 0m2.948s
user 0m2.781s
sys 0m0.168s
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -frename-registers -malign-double \ <- again i'm
reading the info...
-maccumulate-outgoing-args -mno-align-stringops \ <- 17ms
slower. bah.
-minline-all-stringops -fomit-frame-pointer
real 0m2.968s
user 0m2.798s
sys 0m0.170s
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -frename-registers -malign-double \ <- -fforce-mem
in -O2...
-maccumulate-outgoing-args -mno-align-stringops \ <- what about
-fforce-addr?
-fforce-addr -fomit-frame-pointer <- mbu. slower.
real 0m3.132s
user 0m2.970s
sys 0m0.162s
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -frename-registers -malign-double \ <-
-fbranch-count-reg is enabled with -O2
-maccumulate-outgoing-args -mno-align-stringops \ <- what
happens disabling this?
-fno-branch-count-reg -fomit-frame-pointer <- uhm, it's
enabled for a good reason (:+P)
real 0m2.958s
user 0m2.794s
sys 0m0.164s
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -frename-registers -malign-double \ <- slow like
hell.
-maccumulate-outgoing-args -mno-align-stringops \
-fmove-all-movables -freduce-all-givs -freduce-all-givs -fomit-frame-pointer
real 0m3.198s
user 0m3.038s
sys 0m0.160s
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -frename-registers -malign-double \ <- this one
generates imprecise math code
-maccumulate-outgoing-args -mno-align-stringops \ <- but not so
imprecise ;+P
-ffast-math -fomit-frame-pointer
real 0m3.043s
user 0m2.881s
sys 0m0.162s
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -frename-registers -malign-double \ <- let's play
with -fpmath
-maccumulate-outgoing-args -mno-align-stringops \ <- sse: slower
-fpmath=sse -fomit-frame-pointer
real 0m3.048s
user 0m2.890s
sys 0m0.158s
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -frename-registers -malign-double \ <- 387:
-maccumulate-outgoing-args -mno-align-stringops \ <- mmm,
better...
-fpmath=387 -fomit-frame-pointer
real 0m2.952s
user 0m2.788s
sys 0m0.164s
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -frename-registers -malign-double \ <- sse,387:
-maccumulate-outgoing-args -mno-align-stringops \ <- b00,
slower...
-fpmath=sse,387 -fomit-frame-pointer
real 0m3.104s
user 0m2.941s
sys 0m0.163s
*********************************************************************
************************branch probabilities*****************************
*********************************************************************
This is the end of the CFLAGS that gentoo can take, the following works in this way:
You first compile a program with -fprofile-arcs, then run the program a while. When
you do this, the program runs slower than hell, but don't worry, it's creating
information at the side of your already compiled code about branch probabilities,
(without this GCC does random branch prediction, with this GCC is writing the branch
flow to a .da file (with the same name of the .c/.o file that it's being executed, so
DON'T delete your directory with the source code)
After -fprofile-arcs, and running the compiled program, you have to recompile it again
with -fbranch-probabilities, and the compiler will get branch data from the already
generated .da files and make the code run in the directions of the most commonly,
and time consuming, code. Just looks what happens:
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -frename-registers -malign-double \ <- now, the
real part. profiling.
-maccumulate-outgoing-args -mno-align-stringops \ <- first we
compile with -fprofile-arcs
-fpmath=387 -fprofile-arcs -fomit-frame-pointer <-
(compile with -p and use gprof to see nice stats)
real 0m4.048s
user 0m3.882s
sys 0m0.166s
*********************************************************************
CFLAGS= -O2 -march=athlon-xp -frename-registers -malign-double \ <- now, gcc is
using the profiled data
-maccumulate-outgoing-args -mno-align-stringops \ <- what can be
faster than this?? :+)
-fpmath=387 -fbranch-probabilities -fomit-frame-pointer
real 0m2.900s
user 0m2.733s
sys 0m0.167s
*********************************************************************
--
[EMAIL PROTECTED] mailing list