Also, rep movsd will be slower on small counts. On most processors, less than 8 iterations will be faster with a move than with a rep.
This has changed lately: http://software.intel.com/en-us/forums/intel-visual-fortran-compiler-for-windows/topic/51402/reply/34703/ With blocks larger than 512 bytes, SSE/FPU code will always be faster. On 4-Aug-09, at 9:50 PM, Michael Steil wrote: > On 4 Aug 2009, at 17:37, Jose Catena wrote: >>> but how would you want to optimize "rep stosd" anyway? >> >> No way. That's what I said, possibly with the exception of using a >> 64 bit >> equivalent if we could assume that the CPU is 64 bit capable. >> But Alex knows better, he's is calling me an ignorant. He says that >> >> L1: Mov [edi], eax >> Add edi, 4 >> Dec ecx >> Jnz L1 >> >> Is faster than >> >> rep stosd >> >> Both things do exactly the same thing, the later much smaller AND >> FASTER in >> any CPU from the 386 to the i7. > > I have done some tests on all generations of Intel CPUs since Yonah, > and in all cases, rep stosd was faster than any loop I could craft or > GCC would generate from my C code. > > But this does *not* mean that > * rep stosd is by definition faster than a scalar loop > * rep stosd is by definition faster than any kind of loop. > > Look at the test program at the end of this email. It compares rep > stosd with a hand-crafted loop written with SSE instructions and SSE > registers (parts borrowed from XNU). > > On all tested machines, the SSE version is significantly faster (for > big loops): > > Yonah: Genuine Intel(R) CPU T2500 @ 2.00GHz > SSE is 3.34x faster than stosl > > Merom: Intel(R) Core(TM)2 Duo CPU P7350 @ 2.00GHz > SSE is 4.86x faster than stosl > > Penryn: Intel(R) Xeon(R) CPU 5150 @ 2.66GHz > SSE is 4.94x faster than stosl > > Nehalem: Intel(R) Xeon(R) CPU E5462 @ 2.80GHz > SSE is 4.62x faster than stosl > > So one should not assume that it's a good idea to always just use rep > stosd. Use memset(), and have an optimized implementation of memset() > somewhere else. One that can be inlined, and checks the size and > branches to the optimal implementation: Like XNU does it, for example: > > http://fxr.watson.org/fxr/source/osfmk/i386/commpage/?v=xnu-1228 > > Michael > > > #include <stdlib.h> > #include <stdio.h> > #include <string.h> > > #define MIN(a,b) ((a)<(b)? (a):(b)) > > #define DATASIZE (1024*1024) > #define TIMES 10000 > > static inline long long > rdtsc64(void) > { > long long ret; > __asm__ volatile("lfence; rdtsc; lfence" : "=A" (ret)); > return ret; > } > > static inline void > sse(int *p) { > int c_new; > char *p_new; > asm volatile ( > "1: \n" > "movdqa %%xmm0,(%%edi,%%ecx) \n" > "movdqa %%xmm0,16(%%edi,%%ecx) \n" > "movdqa %%xmm0,32(%%edi,%%ecx) \n" > "movdqa %%xmm0,48(%%edi,%%ecx) \n" > "subl $64,%%ecx \n" > "jns 1b \n" > : "=D"(p_new), "=c"(c_new) > : "D"(p), "c"(DATASIZE/sizeof(int)) > ); > } > > static inline void > stos(int *p) { > int c_new; > char *p_new; > asm volatile ( > "rep stosl" > : "=D"(p_new), "=c"(c_new) > : "D"(p), "c"(DATASIZE/sizeof(int)), "a"(1) > ); > } > > int > main() { > void *data = malloc(DATASIZE); > long long t1, t2, t3, m1, m2; > int i; > > t1 = rdtsc64(); > > for (i = 0; i < TIMES; i++) > sse(data); > > t2 = rdtsc64(); > > for (i = 0; i < TIMES; i++) > stos(data); > > t3 = rdtsc64(); > > m1 = t2 - t1; > m2 = t3 - t2; > > if (m1>m2) > printf("stosl is %.2fx faster than SSE\n", (float)m1/m2); > else > printf("SSE is %.2fx faster than stosl\n", (float)m2/m1); > > return 0; > } > > _______________________________________________ > Ros-dev mailing list > [email protected] > http://www.reactos.org/mailman/listinfo/ros-dev Best regards, Alex Ionescu _______________________________________________ Ros-dev mailing list [email protected] http://www.reactos.org/mailman/listinfo/ros-dev
