On Sat, Mar 22, 2008 at 10:59 PM, David Cournapeau <
[EMAIL PROTECTED]> wrote:

> Charles R Harris wrote:
> >
> > It looks like memory access is the bottleneck, otherwise running 4
> > floats through in parallel should go a lot faster. I need to modify
> > the program a bit and see how it works for doubles.
>
> I am not sure the benchmark is really meaningful: it does not uses
> aligned buffers (16 bytes alignement), and because of that, does not
> give a good idea of what can be expected from SSE. It shows why it is
> not so easy to get good performances, and why just throwing a few
> optimized loops won't work, though. Using sse/sse2 from unaligned
> buffers is a waste of time. Without this alignement, you need to take
> into account the alignement (using _mm_loadu_ps vs _mm_load_ps), and
> that's extremely slow, basically killing most of the speed increase you
> can expect from using sse.
>

Yep, but I expect the compilers to take care of alignment, say by inserting
a few single ops when needed. So I would just as soon leave vectorization to
the compilers and wait until they get that good. The only thing needed then
is contiguous data.

Chuck
_______________________________________________
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to