Hi David et al, Very interesting. I thought that the 64-bit gcc's automatically aligned memory on 16-bit (or 32-bit) boundaries. But apparently not. Because running your code certainly made the intrinsic code quite a bit faster. However, another thing that I noticed was that the "simple" code was _much_ faster using gcc-4.3 with -O3 than with -O2. I've noticed this will some other code recently as well -- the auto loop-unrolling really helps for this type of code.
You can see my benchmarks here (posted there to avoind line wrap issues): http://www.cv.nrao.edu/~sransom/vec_results.txt Scott On Sun, Mar 23, 2008 at 01:59:39PM +0900, David Cournapeau wrote: > Charles R Harris wrote: > > > > It looks like memory access is the bottleneck, otherwise running 4 > > floats through in parallel should go a lot faster. I need to modify > > the program a bit and see how it works for doubles. > > I am not sure the benchmark is really meaningful: it does not uses > aligned buffers (16 bytes alignement), and because of that, does not > give a good idea of what can be expected from SSE. It shows why it is > not so easy to get good performances, and why just throwing a few > optimized loops won't work, though. Using sse/sse2 from unaligned > buffers is a waste of time. Without this alignement, you need to take > into account the alignement (using _mm_loadu_ps vs _mm_load_ps), and > that's extremely slow, basically killing most of the speed increase you > can expect from using sse. > > Here what I get with the above benchmark: > > 100 0.0002ms (100.0%) 0.0001ms ( 71.5%) 0.0001ms > ( 85.0%) > 1000 0.0014ms (100.0%) 0.0010ms ( 70.6%) 0.0013ms > ( 96.8%) > 10000 0.0162ms (100.0%) 0.0095ms ( 58.2%) 0.0128ms > ( 78.7%) > 100000 0.4189ms (100.0%) 0.4135ms ( 98.7%) 0.4149ms > ( 99.0%) > 1000000 5.9523ms (100.0%) 5.8933ms ( 99.0%) 5.8910ms > ( 99.0%) > 10000000 58.9645ms (100.0%) 58.2620ms ( 98.8%) 58.7443ms > ( 99.6%) > > Basically, no help at all: this is on a P4, which fpu is extremely slow > if not used with optimized sse. > > Now, if I use posix_memalign, replace the intrinsics for aligned access, > and use an accurate cycle counter (cycle.h, provided by fftw). > > Compiled as is: > > Testing methods... > All OK > > Problem size Simple > Intrin Inline > 100 4.16e+02 cycles (100.0%) 4.04e+02 cycles > ( 97.1%) 4.92e+02 cycles (118.3%) > 1000 3.66e+03 cycles (100.0%) 3.11e+03 cycles > ( 84.8%) 4.10e+03 cycles (111.9%) > 10000 3.47e+04 cycles (100.0%) 3.01e+04 cycles > ( 86.7%) 4.06e+04 cycles (116.8%) > 100000 1.36e+06 cycles (100.0%) 1.34e+06 cycles > ( 98.7%) 1.45e+06 cycles (106.7%) > 1000000 1.92e+07 cycles (100.0%) 1.87e+07 cycles > ( 97.1%) 1.89e+07 cycles ( 98.2%) > 10000000 1.86e+08 cycles (100.0%) 1.80e+08 cycles > ( 96.8%) 1.81e+08 cycles ( 97.4%) > > Compiled with -DALIGNED, wich uses aligned access intrinsics: > > Testing methods... > All OK > > Problem size Simple > Intrin Inline > 100 4.16e+02 cycles (100.0%) 1.96e+02 cycles > ( 47.1%) 4.92e+02 cycles (118.3%) > 1000 3.82e+03 cycles (100.0%) 1.56e+03 cycles > ( 40.8%) 4.22e+03 cycles (110.4%) > 10000 3.46e+04 cycles (100.0%) 1.92e+04 cycles > ( 55.5%) 4.13e+04 cycles (119.4%) > 100000 1.32e+06 cycles (100.0%) 1.12e+06 cycles > ( 85.0%) 1.16e+06 cycles ( 87.8%) > 1000000 1.95e+07 cycles (100.0%) 1.92e+07 cycles > ( 98.3%) 1.95e+07 cycles (100.2%) > 10000000 1.82e+08 cycles (100.0%) 1.79e+08 cycles > ( 98.4%) 1.81e+08 cycles ( 99.3%) > > This gives a drastic difference (I did not touch inline code, because it > is sunday and I am lazy). If I use this on a sane CPU (core 2 duo, > macbook) instead of my pentium4, I get better results (in particular, > sse code is never slower, and I get a double speed increase as long as > the buffer can be in cache). > > It looks like using prefect also gives some improvements when on the > edge of the cache size (my P4 has a 512 kb L2 cache): > > Testing methods... > All OK > > Problem size Simple > Intrin Inline > 100 4.16e+02 cycles (100.0%) 2.52e+02 cycles > ( 60.6%) 4.92e+02 cycles (118.3%) > 1000 3.55e+03 cycles (100.0%) 1.85e+03 cycles > ( 52.2%) 4.21e+03 cycles (118.7%) > 10000 3.48e+04 cycles (100.0%) 1.76e+04 cycles > ( 50.6%) 4.13e+04 cycles (118.9%) > 100000 1.11e+06 cycles (100.0%) 7.20e+05 cycles > ( 64.8%) 1.12e+06 cycles (101.3%) > 1000000 1.91e+07 cycles (100.0%) 1.98e+07 cycles > (103.4%) 1.91e+07 cycles (100.0%) > 10000000 1.83e+08 cycles (100.0%) 1.90e+08 cycles > (103.9%) 1.82e+08 cycles ( 99.3%) > > The code can be seen there: > > http://www.ar.media.kyoto-u.ac.jp/members/david/archives/t2/vec_bench.c > http://www.ar.media.kyoto-u.ac.jp/members/david/archives/t2/Makefile > http://www.ar.media.kyoto-u.ac.jp/members/david/archives/t2/cycle.h > > Another thing that I have not seen mentioned but may worth pursuing is > using SSE in element-wise operations: you can have extremely fast exp, > sin, cos and co using sse. Those are much easier to include in numpy > (but much more difficult to implement...). See for example: > > http://www.pixelglow.com/macstl/ > > cheers, > > David > _______________________________________________ > Numpy-discussion mailing list > Numpy-discussion@scipy.org > http://projects.scipy.org/mailman/listinfo/numpy-discussion -- -- Scott M. Ransom Address: NRAO Phone: (434) 296-0320 520 Edgemont Rd. email: [EMAIL PROTECTED] Charlottesville, VA 22903 USA GPG Fingerprint: 06A9 9553 78BE 16DB 407B FFCA 9BFA B6FF FFD3 2989 _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion