> 2) The K7 has 3 functional units in its FPU, compared to just two for
> every other high-end microprocessor I am aware of. All high-end processors
> have a floating adder and multiplier. I'm not sure what the third unit on
> the K7 does, but I suspect it's either a second floating adder, or perhaps
> capable of doing a fused multiply-add. In either case, the, um, added add
> capability could be exploited to speed an FFT. That's because a typical
> higher-radix FFT implementation is dominated by adds rather than
> multiplies.
I had discussed the Athlon briefly with George, and his understanding was
that Athlon could do an fadd and an fmul in the 2 FPU units at the same
time. This, according to him, is not that useful unless you could really
work hard to make sure both engines are constantly fed with work.
I've confirmed this architecture on the Athlon (visit
http://www.amd.com/products/cpg/athlon/pdf/fpu_wp.pdf for a nice overview of
the Athlon FPU).
Sure enough, there are 3 FPU units. One that does fadd (and 3dnow), one
doing fmul (and 3dnow) and one doing fstore (loads/stores/etc).
I wonder if those benchmarks, like spec_fp, run in such a way that they do
mixes of fadds and fmuls, keeping all 3 pipes constantly busy. If so (and I
bet it does), no wonder the Athlon kicks butt on the FP benchmarks (~137%
faster for a PIII-550 vs. Athlon 550 on specfp_base95).
But as we can see from the real world benchmarks, even though the Athlon
still does better, clock for clock, than a PIII, the difference isn't as
great (only ~107% faster on Winbench 99 FPU Winmark). The benchmark was
done with a PIII Xeon with 512KB L2 cache. I wonder if having a Xeon with a
2MB cache (or even 1MB) would have made this any closer since the Winbench
test is a lot larger than the specfp tests, and a larger cache would likely
have more impact. Hmmm...
So, the question I put out is, can the algorithm be tweaked in any way so
that fadds and fmuls can be done together, or is it, as George initially
suspects, worthless?
Having the pipe just for doing fstore is probably where we'd see most of the
performance benefits, and according to the AMD supplied chart, the execution
time of fmul is less (4=Athlon, 5=PIII), however fadd takes longer on the
Athlon (4=Athlon, 3=PIII). If you wanted to do extended double fdivs, the
Athlon would definitely kick butt (24=Athlon, 37=PIII), and you see the same
thing with the fsqrt (extended) as another example.
If you were pipelining fmuls, the Athlon could spit them out in 1 clock
cycle (after the 4 cycle latency), compared to 2 cycles (after the 5 cycle
latency) on the PIII, so it would be REAL important to get lots of yummy
pipelined fmuls to the Athlon to really let it strut it's stuff.
Aaron
_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers