On 9 Feb 00, at 15:36, [EMAIL PROTECTED] wrote:

> >The optimization that should probably be done for Athlon is to 
> >organize the code to allow FMUL & FADD to execute in parallel (which the
> >Pentium II/III core just can't manage). This could give a speedup of the
> >order of 40%.
> 
> That would be nice if true, but I suspect it's a bit overoptimistic.
> The reason is this: the Athlon utilizes out-of-order execution, i.e.

Yes, that's why I wrote "could", as opposed to "should". How much 
benefit you get from OOE depends to an enormous extent on how the 
code is organized. If you've just retired registers containing 
temporary results which you need back to work on _right now_ then you 
could be working rather inefficiently.

You've made the point in the past that organizing HLL source code 
"properly" gives the optimizer in the compiler a better chance of 
doing a decent job; the same is no less true in that well-organized 
assembler code gives the execution scheduler in the CPU less of a 
chance to foul things up.
> 
> 1a,b,c) How many floating-point registers does the Athlon have? Are these
> all 80 bits? Are they accessed via the same kind of stack-based model as
> the Pentium?

>From the briefing notes I have (which are quite elderly and may not 
correspond with the consumer silicon);
so far as x86 compatible FP operations are concerned,
a) there are 40 FPU registers but only 8 of them are named. (The 
others are available to hold temporaries etc). This register pool is 
shared with the 3D-Now instruction set.
b) Yes. (In 3D-Now mode they actually contain 128 bits)
c) The 8 named FP registers are logically organized as a stack just 
like the Intel model. (Unchanged since the 8087!)

> 2a,b,c) I believe the Athlon has two floating adders in addition to a
> floating multiplier. Can it dispatch 2 FADDs and 1 FMUL per cycle? Can it
> do 2 double- precision FADDs per cycle, or just do single-precision adds
> in parallel?

a) There are two independent 80-bit FP execution units, both can do 
FADD but only one can do FMUL.
b) No. You can do 2 FADDs or 1 FADD + 1 FMUL per cycle.
c) I think in 3D-Now mode you can do 4 SP operations in parallel in 
each execution unit instead of one 80-bit operation.


Regards
Brian Beesley
_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

Reply via email to