On Oct 4, 2011, at 5:38 AM, IOhannes m zmoelnig wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 2011-10-04 09:06, katja wrote:

Yesterday I forgot to mention why it should definitely not be built
with -O0 (unless for debug purposes): PD_BIGORSMALL is defined an

ah yes, this was indeed my fault.
since i don't feel comfortable with editing m_pd.h to get a different
build, i used CFLAGS="-DPD_FLOAT_PRECISION=64", which undid any
optimization flags (which by default are "-O6", which i find a bit
overdone; and "-g" is not set at all...)

the proper way is to use CPPFLAGS="-DPD_FLOAT_PRECISION=64", which
results in:

osc-delay-perftest with 400 instances:
debian           : 31%
original         : 29%
single           : 22%
single(O0)       : 64%
single(O2)       : 25%
single(O2+loop)  : 22%
single(pentium3) : 24%
single(pentium4) : 22%
single(prescott) : 22%
single(core2)    : 22%
single(core2+sse): 22%
double           : 25%
double(O0)       : 86%
double(O2)       : 27%
double(O2+loop)  : 26%
double(pentium3) : 25%
double(pentium4) : 24%
double(prescott) : 24%
double(core2)    : 24%
double(core2+sse): 25%

osc-delay-perftest with 1200 instances:
debian           : 94%
original         : 81%
single           : 65%
single(O2)       : 72%
single(O0)       : ++%
single(O2+loop)  : 66%
single(pentium3) : 70%
single(pentium4) : 66%
single(prescott) : 65%
single(core2)    : 59%
single(core2+sse): 64%
double           : 77%
double(O0)       : ++%
double(O2)       : 82%
double(O2+loop)  : 77%
double(pentium3) : 79%
double(pentium4) : 75%
double(prescott) : 75%
double(core2)    : 71%
double(core2+sse): 75%

which is more inline with katja's measurements.

this is (again) on an i5 650 @ 3.2GHz running in 32bit mode
optimization flags (as far as they can be reconstructed :-))
debian: "-g -O2" (this is what is dictated by debian policy)
original: "-O6 -funroll-loops -fomit-frame-pointer"  (seems to be the
default)
single/double: ->original
(O0): -O0
(O2): -g -O2
(O2+loop): -g -O2 -funroll-loops -fomit-frame-pointer
(prescott): ->original + "-march=prescott"
(core2): ->original + "-march=core2"
(core2+sse): ->original + "-march=core2 -mfpmath=sse -msse2"


so it seems like the biggest performance boost is given (on the tested
platform), by compiling with "-g -O2 -funroll-loops
- -fomit-frame-pointer" (which is cool because i think this can even make
it into debian, the way it is)


inline function (like it was already suggested by IOhannes a while
ago), but at -O0 nothing will be inlined. A benchmark howto would be
useful indeed.


well, i usually just cram lots of the same object into a subpatch (until i get approximately 80% in the slowest environment, in order to not max
out the CUP and get unknown side-effects), and measure it with the
built-in load-meter (for loads <100% it behaves quite the same as top)
nothing very dramatic.


Nice tests, thanks for that. I would be interested to see the effects of auto-vectorization on these numbers. Have you tried that? If the test patch doesn't include objects that have loops vectorized, it won't make a difference.

.hc


----------------------------------------------------------------------------

If you are not part of the solution, you are part of the problem.



_______________________________________________
Pd-dev mailing list
[email protected]
http://lists.puredata.info/listinfo/pd-dev

Reply via email to