On 22/03/2008, Thomas Grill <[EMAIL PROTECTED]> wrote:

> I've experimented with branching the ufuncs into different constant
>  strides and aligned/unaligned cases to be able to use SSE using
>  compiler intrinsics.
>  I expected a considerable gain as i was using float32 with stride 1
>  most of the time.
>  However, profiling revealed that hardly anything was gained because of
>  1) non-alignment of the vectors.... this _could_ be handled by
>  shuffled loading of the values though
>  2) the fact that my application used relatively large vectors that
>  wouldn't fit into the CPU cache, hence the memory transfer slowed down
>  the CPU.
>
>  I found the latter to be a real showstopper for most of my experiments
>  with SIMD. It's especially a problem for numpy because smaller vectors
>  have a lot of Python/numpy overhead, and larger ones don't really
>  benefit due to cache exhaustion.

This particular issue can sometimes be reduced by clever use of the
prefetching intrinsics. I'm not totally sure it's going to help inside
most ufuncs, though, since the runtime is so dominated by memory
reads. In a program I was writing I had time to do a 128-point real
FFT in the time it took to load the next 64 floats...

Anne
_______________________________________________
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to