On 22 Aug 99, at 21:30, Aaron Blosser wrote:

> But as we can see from the real world benchmarks, even though the Athlon
> still does better, clock for clock, than a PIII, the difference isn't as
> great (only ~107% faster on Winbench 99 FPU Winmark).  The benchmark was
> done with a PIII Xeon with 512KB L2 cache.

This is _still_ remarkable, since the "consumer" Athlons starting to 
trickle onto the market have 64-bit 100 MHz FSB and 512KB L2 cache, 
like PII / PIII / Xeon, but run their L2 cache at only 1/3 clock 
speed (c.f. full clock speed for Xeon & 1/2 clock speed for PII/PIII)

The high performance Athlons with 128 bit FSB @ 200 MHz & larger, 
relatively faster L2 cache should be really impressive. Maybe 
starting to approach what Alpha has been doing for a while ;-) (The 
critical difference here is that Athlon does run native IA32 code!)

> So, the question I put out is, can the algorithm be tweaked in any way so
> that fadds and fmuls can be done together, or is it, as George initially
> suspects, worthless?

Hmm. Is is not possible to use two "threads" (NOT in the OS sense - I 
mean working on two streams of data, interleaved) & arrange them so 
that the FMULs for stream 1 happen when stream 2 is doing FADDs on 
independent data, & vice versa? I think you should be able to keep 
the pipes pretty full ... though whether you can do a _lot_ better 
than a decent compiler depends on how smart the instruction prefetch 
/ decode / scheduling in the processor core is. You probably _won't_ 
be able to achieve 100% utilization of all the pipelines all the time 
- I've a sneaky suspicion that George's Pentium code drives the 
memory bus quite hard already; feeding a faster, more efficient 
processor from the same bus bandwidth is going to make it more likely 
that bus saturation, rather than processor throughput capability, 
will limit the ultimate performance of the system as a whole.
> 
> If you were pipelining fmuls, the Athlon could spit them out in 1 clock
> cycle (after the 4 cycle latency), compared to 2 cycles (after the 5 cycle
> latency) on the PIII, so it would be REAL important to get lots of yummy
> pipelined fmuls to the Athlon to really let it strut it's stuff.

Am I missing something here? I thought that the _throughput_ was 1 
FMUL per clock, but there was a 4 clock period between the 
instruction entering the execution unit (meaning that it has already 
had to be prefetched, decoded and the operands made available) and 
the result of the operation becoming available. So, provided there is 
no delay in fetching instructions, there is capacity in the decoders 
and there is no stall due to operands being unavailable, you _should_ 
get a throughput of 1 FMUL per clock (assuming that you aren't also 
scheduling other instructions which block the multiplier execution 
unit, or use its pipeline).

Regards
Brian Beesley
_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

Reply via email to