Thanks for pointing this out , I also note on Neon SIMD is limited to 64bit ..while mentally im focusing on the move to 256 bit this year on x86.
Note the loop code specifically allows SIMD to not be mixed with the GP registers ( and is cleaner and easier than using all the intrinsic) and the copy code is not going to be affected by a 5 cycle stall. It's also worth noting the techniques I commented on to use the 128bit registers more like normal registers to allow more work to be done in 1 cycle are not really worth it for 64 bit on x86 since you can use 64 bit GP registers. You are right about the Hw , the use of SIMD as GPs would be dependent on HW as some HW could not do these GP ops on SIMD registers ( eg equality) and hence the compiler would produce an error. In theory ( in phase 3) you could just communicate the intent and the compiler would decide but we are a long way off for this. I suppose logically we have registers here with limitations eg for 64 bit you have GP 64 bit reg Pure SIMD reg 64 bit SIMD with GP full width functions Rather than use different unions I would think a compiler error is better forcing intrinsic for appropriate platforms when needed. Ben >-----Original Message----- >From: orthochronous [mailto:[email protected]] >Sent: Sunday, August 15, 2010 1:51 AM >To: [email protected]; Discussions about the BitC language >Subject: Re: [bitc-dev] Bitc and Simd > >On Sat, Aug 14, 2010 at 5:06 PM, Ben Kloosterman <[email protected]> >wrote: >> Eg >> >> for ( xmm i = 0 ; i < loopCount ; i = i + 1) >> RunLoopVariableDependentSIMDAlgorithm(i) ; >> >> Or this >> >> //pointers/data must be 16 byte aligned >> int blockMemCopy(void *destination, void *source, int32 size) >> { >> >> xmm *dest = (xmm*)&destination; >> xmm *sour = (xmm*)&source; >> int c; >> >> for(c=0;c< (size <<2) ;c++) >> *dest++ = *sour++; >> >> return c>>2 ; >> } > >Just a quick comment: on ARM chips the NEON unit is deliberately run 5 >cycles behind the main scalar pipeline. As such, it is heavily advised >against using SIMD instructions unless you're actually using the full >SIMD capabilities (ideally using the main pipeline just to do control >flow) since otherwise you incur notable penalties moving both sending >data to and from the unit from the main pipeline. Additionally the >NEON unit on ARM uses only the L2 cache, requiring explicitly making >the L1 cache coherent with L2 before accessing any of the data in the >main part of the CPU: > >http://forums.arm.com/lofiversion/index.php?t12665.html > >This is a reasonable design for multimedia, where most of the time the >scalar and SIMD data-sets are don't overlap. (I'm interested in ARM as >well as Intel because both of these chips turn up in smartphones, >tablets and netbooks.) > >Regards, >David Steven Tweed >No virus found in this incoming message. >Checked by AVG - www.avg.com >Version: 9.0.851 / Virus Database: 271.1.1/3069 - Release Date: 08/14/10 >02:34:00 _______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
