Or it could be a combination of decoding/code alignment problems
(sub-optimal decode cycles) which cause goofy patterns in loops and such. I
suggest running it thru VTUNE and see what comes up there...
There's the good doc at: http://www.agner.org/assem/pentopt.htm which
explains all this stuff better than I could ever hope to.
-Original Message-
From: Blosser, Jeremy [mailto:[EMAIL PROTECTED]]
Sent: Friday, June 11, 1999 10:06 AM
To: '[EMAIL PROTECTED]'
Subject: RE: Mersenne: Pentium Pro Optimization Help Needed
Umm... decoding optimization (4-1-1 rule)
For example in four_complex_cpm_fft_3:
;;1-1-1
fld R6 ;; I2,I3,A2,r/i,A4,I1,R3,R1
fmulst(3), st ;; B2 = I2 * r/i
;23-27
fsubp st(2), st ;; A2 = A2 - I2
;24-26
;;1 (D1, D2 stall)
fld R8 ;; I4,I3,A2,B2,A4,I1,R3,R1
;;2-1 (D2 stall)
fmulQWORD PTR [edi+24] ;; B4 = I4 * r/i
;25-29
fxchst(4) ;; A4,I3,A2,B2,B4,I1,R3,R1
;;2-1 (D2 stall)
fsubR8 ;; A4 = A4 - I4
;26-28
fxchst(2) ;; A2,I3,A4,B2,B4,I1,R3,R1
Plus all the stores are decoded in separate cycles (2 uOps)
I'm sure someone else will correct my mistakes ;)
I'm sure you checked cache alignments... I can't think of anything else
offhand...
Also, I noticed that no attention was paid to as far as K6 optimization (ie
tossing the fxch's) in the current code... Any effort to improve that or is
it not worth it?
-Original Message-
From: George Woltman [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, June 09, 1999 10:19 PM
To: [EMAIL PROTECTED]
Subject: Mersenne: Pentium Pro Optimization Help Needed
Hi all,
I'm trying to optimize prime95 for the Pentium Pro/PII/PIII
architecture. I'm fairly well versed in various execution units
and latencies, but some mysteries remain.
Are there any experts in this field - maybe even some Intel
employees - that could improve the code further? Even one clock
cycle in a macro that will be executed a few quintillion times is
a big help.
The new assemply macros are at ftp://entropia.com/gimps/lucas1p.mac
for you to look at.
Questions: Why is the code faster when I throw in some
no-ops (actually fxch st(0) instructions)? How can I force the
CPU to execute the floating point micro-ops in the optimal order?
Does reordering the fstp instructions have any effect? Are there
other issues I sould consider?
Regards
George - who is looking forward to IA-64 where I am in control of
the opcode scheduling once again. Not to mention lots of registers!
P.S.The clock timings were measured using the following loop. I can
provide more details upon request.
mov al, 0
mov ecx, 250; 1000 iterations
clp1: disp four_complex_cpm_fft_3 8, 16, 32 ;;; or some other
macro
lea esi, [esi+64]
add al, 256/4
jnc clp1
lea esi, [esi-256]
dec ecx ; Check loop counter
jnz clp1; Loop if necessary
Unsubscribe list info -- http://www.scruz.net/~luke/signup.htm
Unsubscribe list info -- http://www.scruz.net/~luke/signup.htm
Unsubscribe list info -- http://www.scruz.net/~luke/signup.htm