RE: Mersenne: Pentium Pro Optimization Help Needed

1999-06-11 Thread Blosser, Jeremy

Or it could be a combination of decoding/code alignment problems
(sub-optimal decode cycles) which cause goofy patterns in loops and such. I
suggest running it thru VTUNE and see what comes up there...

There's the good doc at: http://www.agner.org/assem/pentopt.htm which
explains all this stuff better than I could ever hope to.


-Original Message-
From: Blosser, Jeremy [mailto:[EMAIL PROTECTED]]
Sent: Friday, June 11, 1999 10:06 AM
To: '[EMAIL PROTECTED]'
Subject: RE: Mersenne: Pentium Pro Optimization Help Needed


Umm... decoding optimization (4-1-1 rule)
For example in four_complex_cpm_fft_3:
;;1-1-1
fld R6  ;; I2,I3,A2,r/i,A4,I1,R3,R1
fmulst(3), st   ;; B2 = I2 * r/i
;23-27
fsubp   st(2), st   ;; A2 = A2 - I2
;24-26
;;1 (D1, D2 stall)
fld R8  ;; I4,I3,A2,B2,A4,I1,R3,R1
;;2-1 (D2 stall)
fmulQWORD PTR [edi+24]  ;; B4 = I4 * r/i
;25-29
fxchst(4)   ;; A4,I3,A2,B2,B4,I1,R3,R1
;;2-1 (D2 stall)
fsubR8  ;; A4 = A4 - I4
;26-28
fxchst(2)   ;; A2,I3,A4,B2,B4,I1,R3,R1

Plus all the stores are decoded in separate cycles (2 uOps)
I'm sure someone else will correct my mistakes ;)

I'm sure you checked cache alignments... I can't think of anything else
offhand...

Also, I noticed that no attention was paid to as far as K6 optimization (ie
tossing the fxch's) in the current code... Any effort to improve that or is
it not worth it?

-Original Message-
From: George Woltman [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, June 09, 1999 10:19 PM
To: [EMAIL PROTECTED]
Subject: Mersenne: Pentium Pro Optimization Help Needed


Hi all,

I'm trying to optimize prime95 for the Pentium Pro/PII/PIII
architecture.  I'm fairly well versed in various execution units
and latencies, but some mysteries remain.

Are there any experts in this field - maybe even some Intel
employees - that could improve the code further?  Even one clock 
cycle in a macro that will be executed a few quintillion times is
a big help.

The new assemply macros are at ftp://entropia.com/gimps/lucas1p.mac
for you to look at.

Questions:  Why is the code faster when I throw in some
no-ops (actually fxch st(0) instructions)?  How can I force the
CPU to execute the floating point micro-ops in the optimal order?
Does reordering the fstp instructions have any effect?  Are there
other issues I sould consider?

Regards
George - who is looking forward to IA-64 where I am in control of
the opcode scheduling once again.  Not to mention lots of registers!

P.S.The clock timings were measured using the following loop.  I can
provide more details upon request.
mov al, 0
mov ecx, 250; 1000 iterations
clp1:   disp four_complex_cpm_fft_3 8, 16, 32   ;;; or some other
macro
lea esi, [esi+64]
add al, 256/4
jnc clp1
lea esi, [esi-256]
dec ecx ; Check loop counter
jnz clp1; Loop if necessary



Unsubscribe  list info -- http://www.scruz.net/~luke/signup.htm

Unsubscribe  list info -- http://www.scruz.net/~luke/signup.htm

Unsubscribe  list info -- http://www.scruz.net/~luke/signup.htm



RE: Mersenne: Pentium Pro Optimization Help Needed

1999-06-10 Thread Don Leclair


Hi George,

 I'm trying to optimize prime95 for the Pentium
 Pro/PII/PIII architecture.  I'm fairly well
 versed in various execution units and
 latencies, but some mysteries remain.

In case you haven't run across it yet, you can download the "Intel
Architecture Optimizations Manual" from this web page:

http://developer.intel.com/design/pro/MANUALS/242816.htm

It comes in the form of an Acrobat PDF file and includes a good deal
of helpful information for the Pro/PII/PIII including "Chapter 5
Optimization Techniques for Floating Point Applications" which may be
of particular assistance.

For general coding on the Pro/PII/PIII, the three most important
optimizations seem to be:

1) Helping the branch prediction algorithm to guess better.  This can
involve reducing the number of branches or using new instructions such
as CMOV to eliminate some of them altogether.

2) Avoiding partial register stalls.  Partial register stalls occur
when you write to a 8 or 16 bit register and read from the 32-bit
equivalent (e.g. MOV AX, 1;  ADD ECX, EAX)

3) Aligning data structures on 32-byte boundaries.  According to the
docs, a misaligned read on a Pentium costs 3 cycles but costs 6 to 9
on the Pro, II and III (go figure).

The optimization guide is packed full of tips.  It's about 150 pages
in total, although half of it is a reference guide.

-Don Leclair



Unsubscribe  list info -- http://www.scruz.net/~luke/signup.htm



Mersenne: Pentium Pro Optimization Help Needed

1999-06-09 Thread George Woltman

Hi all,

I'm trying to optimize prime95 for the Pentium Pro/PII/PIII
architecture.  I'm fairly well versed in various execution units
and latencies, but some mysteries remain.

Are there any experts in this field - maybe even some Intel
employees - that could improve the code further?  Even one clock 
cycle in a macro that will be executed a few quintillion times is
a big help.

The new assemply macros are at ftp://entropia.com/gimps/lucas1p.mac
for you to look at.

Questions:  Why is the code faster when I throw in some
no-ops (actually fxch st(0) instructions)?  How can I force the
CPU to execute the floating point micro-ops in the optimal order?
Does reordering the fstp instructions have any effect?  Are there
other issues I sould consider?

Regards
George - who is looking forward to IA-64 where I am in control of
the opcode scheduling once again.  Not to mention lots of registers!

P.S.The clock timings were measured using the following loop.  I can
provide more details upon request.
mov al, 0
mov ecx, 250; 1000 iterations
clp1:   disp four_complex_cpm_fft_3 8, 16, 32   ;;; or some other macro
lea esi, [esi+64]
add al, 256/4
jnc clp1
lea esi, [esi-256]
dec ecx ; Check loop counter
jnz clp1; Loop if necessary



Unsubscribe  list info -- http://www.scruz.net/~luke/signup.htm