Re: [LAD] GCC Vector extensions

Gabriel Beddingfield Tue, 26 Jul 2011 05:30:29 -0700

On 07/26/2011 03:15 AM, Maurizio De Cecco wrote:

So are you now considering use some #ifdef to select float/4 instead of
double/8 vectors in jMax or just change all of them?


Well, at the moment on gcc the perfomance with vector types is the same
as without vector types, so i'll leave the Linux version without vector
types (the code is #ifdef'ed).

When I was playing around with this last night... the best performancecame from your non-optimized, non-vectored code.


Why?

Because GCC translated it to optimized, vectored code.

By the way, i forgot to mentions that all my tests where at 64 bits;
i'll try later on a 32 bit Ubuntu.

I was on 32 bit Ubuntu. Also, with GCC the 64-bit optimizer is known tobe better at optimising SIMD code.

Because I'm a sucker for these kinds of diversions, I came up with ascheme that shaved about 1 second off your test (on my machine). Itassumes that `vecsize` is a power-of-two. The idea is to store stuff inthe processor registers, and access each buffer one page at a time (acache page is 64 bytes on x86... 16 floats).

static inline void add3_vec(float * restrict arg0, float * restrictarg1, float * restrict arg2, unsigned int vecsize)

{
  unsigned int i;
  v4sf *v0, *v1, *v2;
  v4sf c0, c1, c2, c3, c4, c5, c6, c7;
  const unsigned cache_size = 4;

  v0 = (v4sf*)arg0;
  v1 = (v4sf*)arg1;
  v2 = (v4sf*)arg2;
  vecsize /= 4*cache_size;

  while(vecsize--) {
          c0 = *v0++;
          c1 = *v0++;
          c2 = *v0++;
          c3 = *v0++;
          c4 = *v1++;
          c5 = *v1++;
          c6 = *v1++;
          c7 = *v1++;
          *v2++ = c0 + c4;
          *v2++ = c1 + c5;
          *v2++ = c2 + c6;
          *v2++ = c3 + c7;
  }

}

-gabriel
_______________________________________________
Linux-audio-dev mailing list
[email protected]
http://lists.linuxaudio.org/listinfo/linux-audio-dev

Re: [LAD] GCC Vector extensions

Reply via email to