> -----Original Message-----
> From: Aaron Blosser [mailto:[EMAIL PROTECTED]]
> Sent: Wednesday, June 16, 1999 4:53 PM
> To: Mersenne@Base. Com
> Subject: Mersenne: Thoughts on Merced / IA-64
>
>
> I was just perusing the IA-64 docs that came out last
> month...I came up with
> a few thoughts on how it would be a GREAT mersenne prime CPU:
>
> - 128 FPU registers (126 usable)
> 96 of them are rotating (not stacked) which I imagine could
> be used to the
> code's advantage quite well, holding more data in registers
> during the FFT
>
Eh, it would only really help if you wanted to unroll quite a few loops... I
think that as can been seen from the RISC processors out there, it really
doesn't help a _whole_ lot as far as the FFT goes. I suppose you could move
to a radix-8, but that's about the extent of it. Would going past radix-8
help a whole lot?
> - 82bit FPU (??)
> One document mentioned 82 bits for the FPU and registers. I
> imagine this
> would help with round-off problems vs. the 80 bit FPU core. The IA-32
> processors had 80 bits, right? The 82 bits are: 64 bit
> significand, 17 bit
> exponent, 1 bit sign. The IEEE double extended only
> specifies 80, but there
> we are with 82.
>
> - Memory "speculation"
> Preload code and/or data...while the FPU is churning away,
> preload more
> data into L2/L1 cache so it's in the high-speed memory by the
> time it's
> needed (data prefetch/lfetch). That will REALLY help on
> these large FFT
> datasets!
>
Is this limited to MMX/SIMD data only? The 3DNow (and KNI?) instruction sets
have prefetch for their SIMD opcodes, but those of course are single
precision and really kinda useless. :P
> - Faster FPU
> On top of all this, the FPU core is supposedly redesigned to
> do more per
> clock cycle. Some of the "enhancements" I spotted were: having 4 FP
> multiplier accumulators (single precision), the fused multiply-add
> instruction enhancements, load-pair instruction to load 2 FPU
> registers
> simultaneously, etc.
>
> - 64 bit integer ops
> Integer unit with 64 bits...need I say more?
>
Doesn't really help a whole lot. Honest. :) Mainly cuz integer and single
precision operands are hardly ever used.
Well, maybe the fused multiple-add instructions. But I haven't looked to see
exactly what they are...
> - 128 64 bit general purpose registers
>
> - 64 one bit predicate registers
> Separate registers to control the conditionals branching/execution
>
> - 8 64 bit branch registers
> Finally some more registers to hold branch address locations
>
> - 128 "application registers"
> Don't know about these...some are earmarked "for future
> use". Hrmm...
>
> - Bunch of fun parallel arithmetic instructions
> Probably useful for large numbers...
>
Whatever that means...
>
> Anyway, that's just skimming the surface.
>
> I figure with 126 usable 82 bit FP registers, you can have A
> LOT of stuff
> done in the registers alone, speeding up stuff greatly and
> really trimming
> down on worrying about rounding errors once it comes out of
> the register.
> Prefetching data into the cache from main memory will also
> help quite a bit.
> The FPU instruction set has a few new goodies that I foresee
> could help out
> with FFT algorithms.
>
> Not being really on top of how the FFT code really works,
> I'll leave it to
> others to figure out how best this would all help George's code. And
> George...I hope you'll work on a nice IA-64 native program to
> use all this
> cool new stuff once it's available. Using all the EPIC
> "hints" in your
> assembly code might be tricky at first, but I think the
> payoff would be
> significant.
>
The EPIC hints are probably the biggest benefit. Its hard to convince the
CPU to take the right branches and O-O-O execution can really mess up the
pipeline. (Its hard to tell what the heck the P6 is doing dangit!)
> Aaron
>
I'm a bit curious about the K7. Just from the minimal specs I've looked at.
Might be able to squeek a few % more out it than a similarly clocked PIII.
What has me wondering is the 3DNow instructions they added for DSP
instructions. I'm sure their single precision. But it seems kinda wacky they
added 'em in the first place.
Now, if Intel decided to put some extra silicon and support double precision
FP ops in the SIMD instruction set (The registers support it, the silicon
doesn't). Then you'd be able to get double the thrughput in the FFT code,
plus I think the latency goes down (from 2 cycles to 1?) For multiplies.
Has me thinking a bit about a NTT algorithm for doing the FFTs with integers
instead of doubles and using MMX instructions to speed it up...
But then again, I'm working on a totally different algorithm right now
anyway that _should_ be fast. But then again, I'm probably forgetting
something, so until I work out some of the details on paper, I'll leave that
one in hiding. ;)
-Jeremy
________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm