The current level of Intel/AMD hardware uses the SSE3 extensions, which builds on SSE2 and, before that, MMX.
To see what a huge advance these are, you have to know what they replace. Remember the 8087 numeric coprocessor? It sat off on the side of the 8086 and snooped the bus, catching data as it came along and doing numeric operations when told to. To minimize the number of signals needed between the 8086 and the 8087, the 8087 used a stack architecture: 8 internal registers configured as a loop- around stack, with the top-of-stack being a preferred operand. Well, that's what the Pentium has to this very day, if SSE is not used. The interface sucks, mostly because the stack organization makes it pretty much impossible for a compiler to assign variables to registers for any but the shortest-term uses. Heck, even with hand-code it's tough to use the registers efficiently. On top of that, it's a one-operation-at-a-time instruction set. SSE3 is better because it has (1) a real register file that you can assign variables to; (2) a 128-bit interface and two floating-point units. How much you will benefit from SSE3 will depend on the extent to which your operands are in cache. If your operands have to be fetched from uncached memory, the 8087 instruction set, slow as it is, will almost keep up with the data (the Pentium takes heroic measures to keep data moving, including prefetch on up to four separate vector operands, once it sniffs out that you are working on vectors). But if your operands are cached, or if by clever coding you can keep an operand in cache, the SSE3 will outperform the 8087: my pencil calculations indicate by up to a factor of 4x, though I haven't verified this on real code. So, I think that turning on SSE3 instructions will give a modest performance improvement on big arrays, and a bigger improvement on small operands that are usually hanging around in cache: 2x or better on the numeric part of short vector + short vector, which will be a lesser overall improvement after overheads are taken into account. I think that Jsoftware has taken the view so far that this level of improvement is not worth the hassle of having to support the extra builds. Certain operations, notably matrix multiply, could get a bigger improvement factor than 2x, by using the SSE3 instructions in hand-tweaked code that maximizes cache usage. Henry Rich > -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Miller, Raul D > Sent: Monday, October 30, 2006 11:24 AM > To: General forum > Subject: RE: [Jgeneral] J 6.01 on Intel Mac > > Alistair Tucker wrote: > > I have read that Core2's specialised vector processor (SSE) > > runs at fully twice the speed of the Core's. Presumably that > > means that an array-centric application like J will run twice > > as fast? > > Unlikely. > > SSE's optimizations work around bandwidth limitations in the CPU > (such as would hit you in the context of large arrays) in a > fashion which is most useful when dealing with small fix-sized > arrays. > > SSE would probably be useful in the context of some of J's > "special code" -- special case algorithms which take advantage > of the restrictions implied by certain sequences of operations. > But I doubt very much that SSE would be much use for J's core > operations. > > On top of that, SSE does not work on all PCs. This means that > if SSE were used, we'd either see a different executable which > supports SSE, or J would be larger (as it would need to incorporate > an additional non-SSE implementation of every routine which > supports SSE). > > Finally, if the ISI folks had gone to the effort of providing > "special SSE" code, I think it would be documented in the > release notes. > > -- > Raul > > > ---------------------------------------------------------------------- > For information about J forums see > http://www.jsoftware.com/forums.htm ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
