----- Original Message -----
> On 03/11/2013 07:56 AM, Jose Fonseca wrote:
> > I'm surprised this is is faster.
> >
> > In particular, for big things we'll be touching memory twice.
> >
> > Did you measure the speed up?
> 
> The second hit is cache-hot, so it may not be too expensive.  

Yes, but the size in question is 1900x1200, ie, 9MB, which will trash L1-L2 
caches, and won't even fit on the L3 cache of several processors.

I'm afraid we'd be optimizing some cases at expense of others.

I think that at very least we should do this in 16KB/32KB or so chunks to avoid 
trashing the lower level caches.

> I suspect
> memcpy is optimized to fill the cache in a more efficient manner than
> the old loop.  Since the old loop did a read and a bit-wise or, it's
> also possible the compiler generated some really dumb code.  We'd have
> to look at the assembly output to know.
> 
> As Patrick suggests, there's probably an SSE2 method to do this even
> faster.  That may be worth investigating.

An SSE2 is quite easy with intrinsics:
 
  _m128i pixels = _mm_loadu_si128((const __m128i *)src); // could use 
_mm_load_si128 with some checks
  pixels = _mm_or_si128(pixels, _mm_set1_epi32(0xff000000));
  _mm_storeu_si128((__m128i *)dst, pixels);
  src += sizeof(__m128i) / sizeof *src;
  dst += sizeof(__m128i) / sizeof *dst;

the hard part is the runtime check for sse2 support...

Jose
_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev

Reply via email to