Re: [ros-dev] [ros-diffs] [tkreuzer] 42353: asm version of DIB_32BPP_ColorFill: - Add frame pointer - Get rid of algin_draw, 32bpp surfaces must be DWORD aligned - Optimize the loop - Add comments

Alex Ionescu Tue, 04 Aug 2009 21:56:01 -0700

Also, rep movsd will be slower on small counts. On most processors,  
less than 8 iterations will be faster with a move than with a rep.


This has changed lately: 
http://software.intel.com/en-us/forums/intel-visual-fortran-compiler-for-windows/topic/51402/reply/34703/

With blocks larger than 512 bytes, SSE/FPU code will always be faster.

On 4-Aug-09, at 9:50 PM, Michael Steil wrote:

> On 4 Aug 2009, at 17:37, Jose Catena wrote:
>>> but how would you want to optimize "rep stosd" anyway?
>>
>> No way. That's what I said, possibly with the exception of using a
>> 64 bit
>> equivalent if we could assume that the CPU is 64 bit capable.
>> But Alex knows better, he's is calling me an ignorant. He says that
>>
>> L1:  Mov [edi], eax
>>      Add edi, 4
>>      Dec ecx
>>      Jnz L1
>>
>> Is faster than
>>
>>      rep stosd
>>
>> Both things do exactly the same thing, the later much smaller AND
>> FASTER in
>> any CPU from the 386 to the i7.
>
> I have done some tests on all generations of Intel CPUs since Yonah,
> and in all cases, rep stosd was faster than any loop I could craft or
> GCC would generate from my C code.
>
> But this does *not* mean that
> * rep stosd is by definition faster than a scalar loop
> * rep stosd is by definition faster than any kind of loop.
>
> Look at the test program at the end of this email. It compares rep
> stosd with a hand-crafted loop written with SSE instructions and SSE
> registers (parts borrowed from XNU).
>
> On all tested machines, the SSE version is significantly faster (for
> big loops):
>
> Yonah: Genuine Intel(R) CPU           T2500  @ 2.00GHz
> SSE is 3.34x faster than stosl
>
> Merom: Intel(R) Core(TM)2 Duo CPU     P7350  @ 2.00GHz
> SSE is 4.86x faster than stosl
>
> Penryn: Intel(R) Xeon(R) CPU            5150  @ 2.66GHz
> SSE is 4.94x faster than stosl
>
> Nehalem: Intel(R) Xeon(R) CPU           E5462  @ 2.80GHz
> SSE is 4.62x faster than stosl
>
> So one should not assume that it's a good idea to always just use rep
> stosd. Use memset(), and have an optimized implementation of memset()
> somewhere else. One that can be inlined, and checks the size and
> branches to the optimal implementation: Like XNU does it, for example:
>
> http://fxr.watson.org/fxr/source/osfmk/i386/commpage/?v=xnu-1228
>
>   Michael
>
>
> #include <stdlib.h>
> #include <stdio.h>
> #include <string.h>
>
> #define MIN(a,b) ((a)<(b)? (a):(b))
>
> #define DATASIZE (1024*1024)
> #define TIMES 10000
>
> static inline long long
> rdtsc64(void)
> {
>       long long ret;
>       __asm__ volatile("lfence; rdtsc; lfence" : "=A" (ret));
>       return ret;
> }
>
> static inline void
> sse(int *p) {
>       int c_new;
>       char *p_new;
>       asm volatile (
>               "1:                             \n"
>               "movdqa  %%xmm0,(%%edi,%%ecx)   \n"
>               "movdqa  %%xmm0,16(%%edi,%%ecx) \n"
>               "movdqa  %%xmm0,32(%%edi,%%ecx) \n"
>               "movdqa  %%xmm0,48(%%edi,%%ecx) \n"
>               "subl    $64,%%ecx              \n"
>               "jns     1b                     \n"
>               : "=D"(p_new), "=c"(c_new)
>               : "D"(p), "c"(DATASIZE/sizeof(int))
>       );
> }
>
> static inline void
> stos(int *p) {
>       int c_new;
>       char *p_new;
>       asm volatile (
>               "rep stosl"
>               : "=D"(p_new), "=c"(c_new)
>               : "D"(p), "c"(DATASIZE/sizeof(int)), "a"(1)
>       );
> }
>
> int
> main() {
>       void *data = malloc(DATASIZE);
>       long long t1, t2, t3, m1, m2;
>       int i;
>
>       t1 = rdtsc64();
>
>       for (i = 0; i < TIMES; i++)
>               sse(data);
>
>       t2 = rdtsc64();
>
>       for (i = 0; i < TIMES; i++)
>               stos(data);
>
>       t3 = rdtsc64();
>
>       m1 = t2 - t1;
>       m2 = t3 - t2;
>
>       if (m1>m2)
>               printf("stosl is %.2fx faster than SSE\n", (float)m1/m2);
>       else
>               printf("SSE is %.2fx faster than stosl\n", (float)m2/m1);
>
>       return 0;
> }
>
> _______________________________________________
> Ros-dev mailing list
> [email protected]
> http://www.reactos.org/mailman/listinfo/ros-dev

Best regards,
Alex Ionescu


_______________________________________________
Ros-dev mailing list
[email protected]
http://www.reactos.org/mailman/listinfo/ros-dev

Re: [ros-dev] [ros-diffs] [tkreuzer] 42353: asm version of DIB_32BPP_ColorFill: - Add frame pointer - Get rid of algin_draw, 32bpp surfaces must be DWORD aligned - Optimize the loop - Add comments

Reply via email to