On 22 Aug 99, at 21:30, Aaron Blosser wrote:
> But as we can see from the real world benchmarks, even though the Athlon
> still does better, clock for clock, than a PIII, the difference isn't as
> great (only ~107% faster on Winbench 99 FPU Winmark). The benchmark was
> done with a PIII Xeon with 512KB L2 cache.
This is _still_ remarkable, since the "consumer" Athlons starting to
trickle onto the market have 64-bit 100 MHz FSB and 512KB L2 cache,
like PII / PIII / Xeon, but run their L2 cache at only 1/3 clock
speed (c.f. full clock speed for Xeon & 1/2 clock speed for PII/PIII)
The high performance Athlons with 128 bit FSB @ 200 MHz & larger,
relatively faster L2 cache should be really impressive. Maybe
starting to approach what Alpha has been doing for a while ;-) (The
critical difference here is that Athlon does run native IA32 code!)
> So, the question I put out is, can the algorithm be tweaked in any way so
> that fadds and fmuls can be done together, or is it, as George initially
> suspects, worthless?
Hmm. Is is not possible to use two "threads" (NOT in the OS sense - I
mean working on two streams of data, interleaved) & arrange them so
that the FMULs for stream 1 happen when stream 2 is doing FADDs on
independent data, & vice versa? I think you should be able to keep
the pipes pretty full ... though whether you can do a _lot_ better
than a decent compiler depends on how smart the instruction prefetch
/ decode / scheduling in the processor core is. You probably _won't_
be able to achieve 100% utilization of all the pipelines all the time
- I've a sneaky suspicion that George's Pentium code drives the
memory bus quite hard already; feeding a faster, more efficient
processor from the same bus bandwidth is going to make it more likely
that bus saturation, rather than processor throughput capability,
will limit the ultimate performance of the system as a whole.
>
> If you were pipelining fmuls, the Athlon could spit them out in 1 clock
> cycle (after the 4 cycle latency), compared to 2 cycles (after the 5 cycle
> latency) on the PIII, so it would be REAL important to get lots of yummy
> pipelined fmuls to the Athlon to really let it strut it's stuff.
Am I missing something here? I thought that the _throughput_ was 1
FMUL per clock, but there was a 4 clock period between the
instruction entering the execution unit (meaning that it has already
had to be prefetched, decoded and the operands made available) and
the result of the operation becoming available. So, provided there is
no delay in fetching instructions, there is capacity in the decoders
and there is no stall due to operands being unavailable, you _should_
get a throughput of 1 FMUL per clock (assuming that you aren't also
scheduling other instructions which block the multiplier execution
unit, or use its pipeline).
Regards
Brian Beesley
_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers