On Sat, Aug 14, 2010 at 5:06 PM, Ben Kloosterman <[email protected]> wrote: > Eg > > for ( xmm i = 0 ; i < loopCount ; i = i + 1) > RunLoopVariableDependentSIMDAlgorithm(i) ; > > Or this > > //pointers/data must be 16 byte aligned > int blockMemCopy(void *destination, void *source, int32 size) > { > > xmm *dest = (xmm*)&destination; > xmm *sour = (xmm*)&source; > int c; > > for(c=0;c< (size <<2) ;c++) > *dest++ = *sour++; > > return c>>2 ; > }
Just a quick comment: on ARM chips the NEON unit is deliberately run 5 cycles behind the main scalar pipeline. As such, it is heavily advised against using SIMD instructions unless you're actually using the full SIMD capabilities (ideally using the main pipeline just to do control flow) since otherwise you incur notable penalties moving both sending data to and from the unit from the main pipeline. Additionally the NEON unit on ARM uses only the L2 cache, requiring explicitly making the L1 cache coherent with L2 before accessing any of the data in the main part of the CPU: http://forums.arm.com/lofiversion/index.php?t12665.html This is a reasonable design for multimedia, where most of the time the scalar and SIMD data-sets are don't overlap. (I'm interested in ARM as well as Intel because both of these chips turn up in smartphones, tablets and netbooks.) Regards, David Steven Tweed _______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
