Hi George,
> I'm trying to optimize prime95 for the Pentium
> Pro/PII/PIII architecture. I'm fairly well
> versed in various execution units and
> latencies, but some mysteries remain.
In case you haven't run across it yet, you can download the "Intel
Architecture Optimizations Manual" from this web page:
http://developer.intel.com/design/pro/MANUALS/242816.htm
It comes in the form of an Acrobat PDF file and includes a good deal
of helpful information for the Pro/PII/PIII including "Chapter 5
Optimization Techniques for Floating Point Applications" which may be
of particular assistance.
For general coding on the Pro/PII/PIII, the three most important
optimizations seem to be:
1) Helping the branch prediction algorithm to guess better. This can
involve reducing the number of branches or using new instructions such
as CMOV to eliminate some of them altogether.
2) Avoiding partial register stalls. Partial register stalls occur
when you write to a 8 or 16 bit register and read from the 32-bit
equivalent (e.g. MOV AX, 1; ADD ECX, EAX)
3) Aligning data structures on 32-byte boundaries. According to the
docs, a misaligned read on a Pentium costs 3 cycles but costs 6 to 9
on the Pro, II and III (go figure).
The optimization guide is packed full of tips. It's about 150 pages
in total, although half of it is a reference guide.
-Don Leclair
________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm