On Sat, Mar 22, 2008 at 10:59 PM, David Cournapeau < [EMAIL PROTECTED]> wrote:
> Charles R Harris wrote: > > > > It looks like memory access is the bottleneck, otherwise running 4 > > floats through in parallel should go a lot faster. I need to modify > > the program a bit and see how it works for doubles. > > I am not sure the benchmark is really meaningful: it does not uses > aligned buffers (16 bytes alignement), and because of that, does not > give a good idea of what can be expected from SSE. It shows why it is > not so easy to get good performances, and why just throwing a few > optimized loops won't work, though. Using sse/sse2 from unaligned > buffers is a waste of time. Without this alignement, you need to take > into account the alignement (using _mm_loadu_ps vs _mm_load_ps), and > that's extremely slow, basically killing most of the speed increase you > can expect from using sse. > Yep, but I expect the compilers to take care of alignment, say by inserting a few single ops when needed. So I would just as soon leave vectorization to the compilers and wait until they get that good. The only thing needed then is contiguous data. Chuck
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion