On 07/26/2011 03:15 AM, Maurizio De Cecco wrote:
So are you now considering use some #ifdef to select float/4 instead of
double/8 vectors in jMax or just change all of them?

Well, at the moment on gcc the perfomance with vector types is the same
as without vector types, so i'll leave the Linux version without vector
types (the code is #ifdef'ed).

When I was playing around with this last night... the best performance came from your non-optimized, non-vectored code.

Why?

Because GCC translated it to optimized, vectored code.

By the way, i forgot to mentions that all my tests where at 64 bits;
i'll try later on a 32 bit Ubuntu.

I was on 32 bit Ubuntu. Also, with GCC the 64-bit optimizer is known to be better at optimising SIMD code.

Because I'm a sucker for these kinds of diversions, I came up with a scheme that shaved about 1 second off your test (on my machine). It assumes that `vecsize` is a power-of-two. The idea is to store stuff in the processor registers, and access each buffer one page at a time (a cache page is 64 bytes on x86... 16 floats).

static inline void add3_vec(float * restrict arg0, float * restrict arg1, float * restrict arg2, unsigned int vecsize)
{
  unsigned int i;
  v4sf *v0, *v1, *v2;
  v4sf c0, c1, c2, c3, c4, c5, c6, c7;
  const unsigned cache_size = 4;

  v0 = (v4sf*)arg0;
  v1 = (v4sf*)arg1;
  v2 = (v4sf*)arg2;
  vecsize /= 4*cache_size;

  while(vecsize--) {
          c0 = *v0++;
          c1 = *v0++;
          c2 = *v0++;
          c3 = *v0++;
          c4 = *v1++;
          c5 = *v1++;
          c6 = *v1++;
          c7 = *v1++;
          *v2++ = c0 + c4;
          *v2++ = c1 + c5;
          *v2++ = c2 + c6;
          *v2++ = c3 + c7;
  }

}

-gabriel
_______________________________________________
Linux-audio-dev mailing list
[email protected]
http://lists.linuxaudio.org/listinfo/linux-audio-dev

Reply via email to