On Oct 4, 2011, at 5:38 AM, IOhannes m zmoelnig wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 2011-10-04 09:06, katja wrote:
Yesterday I forgot to mention why it should definitely not be built
with -O0 (unless for debug purposes): PD_BIGORSMALL is defined an
ah yes, this was indeed my fault.
since i don't feel comfortable with editing m_pd.h to get a different
build, i used CFLAGS="-DPD_FLOAT_PRECISION=64", which undid any
optimization flags (which by default are "-O6", which i find a bit
overdone; and "-g" is not set at all...)
the proper way is to use CPPFLAGS="-DPD_FLOAT_PRECISION=64", which
results in:
osc-delay-perftest with 400 instances:
debian : 31%
original : 29%
single : 22%
single(O0) : 64%
single(O2) : 25%
single(O2+loop) : 22%
single(pentium3) : 24%
single(pentium4) : 22%
single(prescott) : 22%
single(core2) : 22%
single(core2+sse): 22%
double : 25%
double(O0) : 86%
double(O2) : 27%
double(O2+loop) : 26%
double(pentium3) : 25%
double(pentium4) : 24%
double(prescott) : 24%
double(core2) : 24%
double(core2+sse): 25%
osc-delay-perftest with 1200 instances:
debian : 94%
original : 81%
single : 65%
single(O2) : 72%
single(O0) : ++%
single(O2+loop) : 66%
single(pentium3) : 70%
single(pentium4) : 66%
single(prescott) : 65%
single(core2) : 59%
single(core2+sse): 64%
double : 77%
double(O0) : ++%
double(O2) : 82%
double(O2+loop) : 77%
double(pentium3) : 79%
double(pentium4) : 75%
double(prescott) : 75%
double(core2) : 71%
double(core2+sse): 75%
which is more inline with katja's measurements.
this is (again) on an i5 650 @ 3.2GHz running in 32bit mode
optimization flags (as far as they can be reconstructed :-))
debian: "-g -O2" (this is what is dictated by debian policy)
original: "-O6 -funroll-loops -fomit-frame-pointer" (seems to be the
default)
single/double: ->original
(O0): -O0
(O2): -g -O2
(O2+loop): -g -O2 -funroll-loops -fomit-frame-pointer
(prescott): ->original + "-march=prescott"
(core2): ->original + "-march=core2"
(core2+sse): ->original + "-march=core2 -mfpmath=sse -msse2"
so it seems like the biggest performance boost is given (on the tested
platform), by compiling with "-g -O2 -funroll-loops
- -fomit-frame-pointer" (which is cool because i think this can even
make
it into debian, the way it is)
inline function (like it was already suggested by IOhannes a while
ago), but at -O0 nothing will be inlined. A benchmark howto would be
useful indeed.
well, i usually just cram lots of the same object into a subpatch
(until
i get approximately 80% in the slowest environment, in order to not
max
out the CUP and get unknown side-effects), and measure it with the
built-in load-meter (for loads <100% it behaves quite the same as top)
nothing very dramatic.
Nice tests, thanks for that. I would be interested to see the effects
of auto-vectorization on these numbers. Have you tried that? If the
test patch doesn't include objects that have loops vectorized, it
won't make a difference.
.hc
----------------------------------------------------------------------------
If you are not part of the solution, you are part of the problem.
_______________________________________________
Pd-dev mailing list
[email protected]
http://lists.puredata.info/listinfo/pd-dev