There have been some recent posts about the AMD K7 (Athlon), and
whether properly tuning the Prime95 code for the K7 could result in
better performance than on a similar-clock-rate Pentium. Of course
George W. can comment definitively, but in theory the K7 should be
capable of delivering rather better performance. Here's my reasoning:
1) The K7 has essentially the same ultra-high-performance bus as the
Alpha 21264, which I believe is 128 bits wide and runs at 200MHz, thus
is capable of feeding the hungry processor with significantly more data;
2) The K7 has 3 functional units in its FPU, compared to just two for
every other high-end microprocessor I am aware of. All high-end processors
have a floating adder and multiplier. I'm not sure what the third unit on
the K7 does, but I suspect it's either a second floating adder, or perhaps
capable of doing a fused multiply-add. In either case, the, um, added add
capability could be exploited to speed an FFT. That's because a typical
higher-radix FFT implementation is dominated by adds rather than multiplies.
For example, a typical complex radix-16 FFT pass in my Mlucas code
takes 16 complex data (32 8-byte floats), and including multiplies by
"twiddle" factors (FFT sincos data) does 168 FADDs and 88 FMULs on them-
that's nearly twice as many adds as multiplies, i.e. the floating multiplier
is idle nearly half the time. If one could do two adds per cycle, one could
remedy that imbalance and (neglecting the fact that memory traffic is also
responsible for many processor stalls) in theory nearly double the speed.
I've always thought the cheapest way to improve performance of a transform-
based code (since floating hardware is very expensive) would be to design
a floating adder that could take two data x and y and return both the sum
and difference, x+y and x-y, in one clock. This would be a much cheaper way to
speed an FFT than adding an entire third unit to the FPU, since much of this
pair of operations is the same (for instance, the exponent compare and
mantissa
shift are the same for the sum and diference, i.e. need only be done once.)
One would think, e.g. a floating adder and a fused add/multiply unit (in lieu
of the pure multiplier) might be capable of giving a similar performance
boost,
but in practice it appears difficult or impossible to schedule the above
operations so as to be doable in <= 100 clocks with such a combination.
Of course people don't just run FFTs, which is why fused multiply-add is
preferred by hardware designers over fused add-subtract.
-Ernst
_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers