Hi David et al,

Very interesting.  I thought that the 64-bit gcc's automatically
aligned memory on 16-bit (or 32-bit) boundaries.  But apparently
not.  Because running your code certainly made the intrinsic code
quite a bit faster.  However, another thing that I noticed was
that the "simple" code was _much_ faster using gcc-4.3 with -O3
than with -O2.  I've noticed this will some other code recently as
well -- the auto loop-unrolling really helps for this type of
code.

You can see my benchmarks here (posted there to avoind line wrap
issues):
http://www.cv.nrao.edu/~sransom/vec_results.txt

Scott


On Sun, Mar 23, 2008 at 01:59:39PM +0900, David Cournapeau wrote:
> Charles R Harris wrote:
> >
> > It looks like memory access is the bottleneck, otherwise running 4 
> > floats through in parallel should go a lot faster. I need to modify 
> > the program a bit and see how it works for doubles.
> 
> I am not sure the benchmark is really meaningful: it does not uses 
> aligned buffers (16 bytes alignement), and because of that, does not 
> give a good idea of what can be expected from SSE. It shows why it is 
> not so easy to get good performances, and why just throwing a few 
> optimized loops won't work, though. Using sse/sse2 from unaligned 
> buffers is a waste of time. Without this alignement, you need to take 
> into account the alignement (using _mm_loadu_ps vs _mm_load_ps), and 
> that's extremely slow, basically killing most of the speed increase you 
> can expect from using sse.
> 
> Here what I get with the above benchmark:
> 
>                  100   0.0002ms (100.0%)   0.0001ms ( 71.5%)   0.0001ms 
> ( 85.0%)
>                 1000   0.0014ms (100.0%)   0.0010ms ( 70.6%)   0.0013ms 
> ( 96.8%)
>                10000   0.0162ms (100.0%)   0.0095ms ( 58.2%)   0.0128ms 
> ( 78.7%)
>               100000   0.4189ms (100.0%)   0.4135ms ( 98.7%)   0.4149ms 
> ( 99.0%)
>              1000000   5.9523ms (100.0%)   5.8933ms ( 99.0%)   5.8910ms 
> ( 99.0%)
>             10000000  58.9645ms (100.0%)  58.2620ms ( 98.8%)  58.7443ms 
> ( 99.6%)
> 
> Basically, no help at all: this is on a P4, which fpu is extremely slow 
> if not used with optimized sse.
> 
> Now, if I use posix_memalign, replace the intrinsics for aligned access, 
> and use an accurate cycle counter (cycle.h, provided by fftw).
> 
> Compiled as is:
> 
> Testing methods...
> All OK
> 
>         Problem size                  Simple                  
> Intrin                  Inline
>                  100    4.16e+02 cycles (100.0%)        4.04e+02 cycles 
> ( 97.1%)        4.92e+02 cycles (118.3%)
>                 1000    3.66e+03 cycles (100.0%)        3.11e+03 cycles 
> ( 84.8%)        4.10e+03 cycles (111.9%)
>                10000    3.47e+04 cycles (100.0%)        3.01e+04 cycles 
> ( 86.7%)        4.06e+04 cycles (116.8%)
>               100000    1.36e+06 cycles (100.0%)        1.34e+06 cycles 
> ( 98.7%)        1.45e+06 cycles (106.7%)
>              1000000    1.92e+07 cycles (100.0%)        1.87e+07 cycles 
> ( 97.1%)        1.89e+07 cycles ( 98.2%)
>             10000000    1.86e+08 cycles (100.0%)        1.80e+08 cycles 
> ( 96.8%)        1.81e+08 cycles ( 97.4%)
> 
> Compiled with -DALIGNED, wich uses aligned access intrinsics:
> 
> Testing methods...
> All OK
> 
>         Problem size                  Simple                  
> Intrin                  Inline
>                  100    4.16e+02 cycles (100.0%)        1.96e+02 cycles 
> ( 47.1%)        4.92e+02 cycles (118.3%)
>                 1000    3.82e+03 cycles (100.0%)        1.56e+03 cycles 
> ( 40.8%)        4.22e+03 cycles (110.4%)
>                10000    3.46e+04 cycles (100.0%)        1.92e+04 cycles 
> ( 55.5%)        4.13e+04 cycles (119.4%)
>               100000    1.32e+06 cycles (100.0%)        1.12e+06 cycles 
> ( 85.0%)        1.16e+06 cycles ( 87.8%)
>              1000000    1.95e+07 cycles (100.0%)        1.92e+07 cycles 
> ( 98.3%)        1.95e+07 cycles (100.2%)
>             10000000    1.82e+08 cycles (100.0%)        1.79e+08 cycles 
> ( 98.4%)        1.81e+08 cycles ( 99.3%)
> 
> This gives a drastic difference (I did not touch inline code, because it 
> is sunday and I am lazy). If I use this on a sane CPU (core 2 duo, 
> macbook) instead of my pentium4, I get better results (in particular, 
> sse code is never slower, and I get a double speed increase as long as 
> the buffer can be in cache).
> 
> It looks like using prefect also gives some improvements when on the 
> edge of the cache size (my P4 has a 512 kb L2 cache):
> 
> Testing methods...
> All OK
> 
>         Problem size                  Simple                  
> Intrin                  Inline
>                  100    4.16e+02 cycles (100.0%)        2.52e+02 cycles 
> ( 60.6%)        4.92e+02 cycles (118.3%)
>                 1000    3.55e+03 cycles (100.0%)        1.85e+03 cycles 
> ( 52.2%)        4.21e+03 cycles (118.7%)
>                10000    3.48e+04 cycles (100.0%)        1.76e+04 cycles 
> ( 50.6%)        4.13e+04 cycles (118.9%)
>               100000    1.11e+06 cycles (100.0%)        7.20e+05 cycles 
> ( 64.8%)        1.12e+06 cycles (101.3%)
>              1000000    1.91e+07 cycles (100.0%)        1.98e+07 cycles 
> (103.4%)        1.91e+07 cycles (100.0%)
>             10000000    1.83e+08 cycles (100.0%)        1.90e+08 cycles 
> (103.9%)        1.82e+08 cycles ( 99.3%)
> 
> The code can be seen there:
> 
> http://www.ar.media.kyoto-u.ac.jp/members/david/archives/t2/vec_bench.c
> http://www.ar.media.kyoto-u.ac.jp/members/david/archives/t2/Makefile
> http://www.ar.media.kyoto-u.ac.jp/members/david/archives/t2/cycle.h
> 
> Another thing that I have not seen mentioned but may worth pursuing is 
> using SSE in element-wise operations: you can have extremely fast exp, 
> sin, cos and co using sse. Those are much easier to include in numpy 
> (but much more difficult to implement...). See for example:
> 
> http://www.pixelglow.com/macstl/
> 
> cheers,
> 
> David
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion@scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion

-- 
-- 
Scott M. Ransom            Address:  NRAO
Phone:  (434) 296-0320               520 Edgemont Rd.
email:  [EMAIL PROTECTED]             Charlottesville, VA 22903 USA
GPG Fingerprint: 06A9 9553 78BE 16DB 407B  FFCA 9BFA B6FF FFD3 2989
_______________________________________________
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to