Mersenne Digest V1 #618

Mersenne Digest Tue, 24 Aug 1999 00:15:00 -0700

Mersenne Digest        Tuesday, August 24 1999        Volume 01 : Number 618




----------------------------------------------------------------------

Date: Sat, 21 Aug 1999 18:59:22 EDT
From: [EMAIL PROTECTED]
Subject: Mersenne: Re: DS20 timings

Simon Burge writes (Hi, Simon!):

>It looks like the DEC C compiler has the same problem - adding "-assume
>accuracy_sensitive" to the compiler command line fixed the problem where
>MacLucasUnix thought it could do 33219281 with a 1M FFT

Yes, MacLucasUnix uses the same fast round strategy, so would suffer the
same fate under aggressive optimization unless care is taken.

>Have you got 2.6x and 2.6a mixed up?  From what I understand, David used
2.6x for the double check, and 2.6a is currently on your ftp site.

I hope not - otherwise, what was I working on for most of the past week? :)
Actually, you're right - I created 2.6b and executables on another server
(the one which has my F90 license), and in my haste to transfer the final
files to my webserver and finish writing my posting I neglected to transfer
the 2.6b source to the webserver (the executables were there). It's there
now. The 2.6a source has been moved to the /pub/archived directory, for those
of you who want to see what changes I made to the source to squeeze more
speed out of the code.

>These results are for a copy of Mlucas_2.6a.f90 I compiled with:
>
>   f90 -o lm -tune ev6 -O5 lucas_mayer_V2.5b.f90

Many thanks for the timings! Looks like Mlucas is no more than 20% slower
than MacLucasUnix at any runlength, and catches up with it at 1024K. Now
get ready for a 15-20% further speed boost...

The good news is, v2.6b should be about 10% faster (esp. at large FFT
lengths than your timings of 2.6a.

Note that for 2.6b I've precompiled 3 separate executables for the three
main Alpha generations (ev4 = 21064, ev5 = 21164, ev6 = 21264). If you
have time, try compiling the 2.6b source locally with various options
and see if you can beat the ones I used for the ev6 executable (but
please see me note about the timings below!)

>The Mlucas_2.6a.exe from your FTP site gives the same results but is
>slightly slower (around 5%) on the DS20:

That 5% is likely due to the fact that the 2.6a .exe was compiled
for a generic platform, i.e. sans ev4/5/6-specific tunings.

>   % time ./Mlucas_2.6a.exe < foo

Note that on platforms which support the real*16 data type, you can't
just do a single timing run and get an accurate timing. That's because
if real*16 (actually, any extended-precision floating data type with
at least 18 decimal digits of precision - I coded it that way to access
the x86 80-bit register type, with compilers that can make use of it)
is available, the program auto-detects it and uses it for high-precision
FFT sincos data initializations. This may take up to the equivalent of
40-80 iteration times. To subtract this initialization component, you
need to do 100 and 200-iteration timing runs, subtract the difference
and divide by 100. (The README file describes this.)

I look forward to the revised (and hopefully improved) ev6 timings!

Brian Beesley writes (Hi, Brian!):

>> Hmm, those upper limits seem a bit low. Does MacLucasUnix tell you when
>> the exponent is too large? 
>
>The way I understand MacLucasUNIX _should_ work is that it guesses 
>that the FFT size the next power of 2 above exponent/32 might work. 
>This is usually hopelessly optimistic. However, it monitors the round-
>off errors for the first "few" (hundreds) of iterations, if the round-
>off errors are too large then it restarts with double the FFT size.

See Simon's note about the accuracy_sensitive option being needed for
this to work properly.

>> size: 128K   256K   320K   384K   448K   512K   640K   768K   896K  1024K
>> pmax: 2.62M  5.20M  6.46M  7.71M  8.96M  10.2M  12.6M  15.1M  17.5M  20M
>> 
>> On strictly real*8-type hardware, the upper limits are about 1% less.
>> The only cost of real*16 inits is a greater initialization time.
>
>Does this not make assumptions about the accuracy of the trig 
>functions in the library, also does it take into account the effect 
>of any "magic numbers" used in the DWT to achieve remaindering modulo 
>2^p-1 ?

Sure it makes some assumptions, but not unreasonable ones. I know this
because the SPEC98-submitted version of Mlucas has been tried on a huge
variety of platforms and compiled with many different compilers, and the
"1% rule" has proved a good one. Even with the most aggressive compile
options and specifying less-accurate trig functions that are faster (a
silly thing to do with Mlucas, since one only computes them once, so why
not spend a few seconds more and get better accuracy?), one could still
always get a pmax at least 98% of the ones listed above. The DWT weight
factors (your "magic numbers") obey similar rules, i.e. there's nothing
special about them (at least to the F90 compiler - perhaps the C compiler
uses a different library? That would be dumb, but not shocking) that
should cause them to be much more inaccurate than the trig functions,
assuming the intrinsic function library was well-written (that may not
be a good assumption on all platforms, unfortunately.)

>> Now, if we could squeeze gains out by assembly coding anywhere near those
>> Jason Papadopoulos obtained on the SPARC via the ASM route, that would
>> be impressive, indeed.
>
>I don't know what the F90 compilers do - but gcc can produce ASM from 
>C source files - if I had loads of time, I'd start from that & a 
>profile listing of where the program spends most of its time, & see 
>how much I could improve those bits of the code by massaging the ASM 
>output by the compiler.

Pretty much what I had in mind for starters, except I'd rather spend my
time working on basic algorithmic improvements and code functionality
(e.g. adding p-1 factoring.) Any ASM experts out there with some free
time?

Cheers,
Ernst

_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

------------------------------

Date: Sat, 21 Aug 1999 21:18:52 EDT
From: [EMAIL PROTECTED]
Subject: Mersenne: ONLINE.EXE malfunction?

Hrm.

[STL clicks on ONLINE.EXE]

The ONLINE.EXE file is linked to missing export 
WININET.DLL:InternetGetConnectedState.

["Yeah, yeah." STL clicks OK.]

C:\TEMP\ONLINE.EXE
A device attached to the system is not functioning.

["Auuugh! You hunk of junk computer!" STL pounds his fist on the desk for 
emphasis.]

Anyone knows what caused this error message? I'm not emailing this directly 
to Mr. Kurowski because it's not working right, and he's probably flooded 
with successful results. :-S

<<- type of Internet connection (dialup modem, LAN, DSL, ISDN, etc.)>>
Dialup modem, US Robotics something or other.
<<- output of program test run(s)>>
A malfunction. :-O
<<- if the connection was really open or not>>
It was open.
<<- ISP connection method & if applicable, version
      (direct, Windows dialup networking [DUN], AOL 4.0, Compuserve 3.0, 
etc.)>>
I dialup to AOL, version 3.0 for Windows 95, revision 131.62.
<< - web browsers & versions installed (Netscape 3, IE4.0, IE5, etc.)>>
IE3 and Netscape 3, but AOL uses the former.
<< - Operating system (Win95, Win98, NT 4.0 workstation or server)>>
Win95 OSR2.

Is this problem software based? The mention of a DLL makes me think so...

Oh well,
S.T.L.
_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

------------------------------

Date: Sun, 22 Aug 1999 05:33:57 +0200 (CEST)
From: Henrik Olsen <[EMAIL PROTECTED]>
Subject: Re: Mersenne: ONLINE.EXE malfunction?

On Sat, 21 Aug 1999 [EMAIL PROTECTED] wrote:
> Anyone knows what caused this error message? I'm not emailing this directly 
> to Mr. Kurowski because it's not working right, and he's probably flooded 
> with successful results. :-S
He wrote the code, he's the one entitled to be told of bugs.

How do you expect him to be able to write code that works on every machine
if you won't tell him when it doesn't work.

- -- 
Henrik Olsen,  Dawn Solutions I/S       URL=http://www.iaeste.dk/~henrik/
Thomas Covenant: I am the savior of The Land.   Linden Avery: Can I help?
Thomas Covenant: Over my dead body.(dies)  (Linden Avery saves The Land.)
          The Second Chronicles of Thomas Covenant, Book-A-Minute version



_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

------------------------------

Date: Sun, 22 Aug 1999 15:06:32 +1000
From: Simon Burge <[EMAIL PROTECTED]>
Subject: Mersenne: Re: DS20 timings 

[EMAIL PROTECTED] wrote:

> Simon Burge writes (Hi, Simon!):

Howdy!

>  The 2.6a source has been moved to the /pub/archived directory, for those
> of you who want to see what changes I made to the source to squeeze more
> speed out of the code.

I'm curious - there seems to be definite patterns in the sections of
code (especially when comparing the different length radix subroutines).
Is any of this code machine generated, or is it all hand generated?  In
the past I attempted to get a C version of your code working but didn't
quite get there, and then Gord Palameta did a version last year.  I was
thinking of taking a stab at trying 2.6b again...

> I look forward to the revised (and hopefully improved) ev6 timings!

Here's some more times with a few explainitory notes below:

                                Platform/per-iteration time (sec)
            500MHz 21264    500MHz 21264    500MHz 21264
            64kB I-cache    64kB I-cache    64kB I-cache
            64kB D-cache    64kB D-cache    64kB D-cache
            4MB L2          4MB L2          4MB L2
            Mlucas2.6a[1]   Mlucas2.6b[2]   Mlucas2.6b[3]  [4]
FFT length: -------------   -------------   -------------  -------
 128K       0.043           0.040           0.041          0.001
 160K       0.051           0.049           0.053          0.003
 192K       0.062           0.062           0.066          0.004
 224K       0.081           0.081           0.081          0.001
 256K       0.10            0.089           0.093          0.003
 320K       0.12            0.112           0.118          0.005
 384K       0.16            0.14            0.15           0.009
 448K       0.19            0.18            0.18           0.003
 512K       0.21            0.19            0.20           0.004
 640K       0.28            0.25            0.26           0.015
 768K       0.38            0.34            0.36           0.011
 896K       0.40            0.38            0.38           0.006
1024K       0.46            0.42

[1] - This is the average for the first N iters INCLUDING SETUP TIME.
      N ranges from 8000 for a 128K FFT to 500 for a 1024K FFT.
[2] - This is the average for the time difference between a 100 iter
      run and a 200 iter run as described in the README file and
      completely ignored by me the first time around :-)  This using
      my binary compiled with:
        f90 -arch ev6 -tune ev6 -fast -O5 -assume accuracy_sensitive
[3] - As for [2], but using Ernst's Mlucas_2.6b.exe.ev6 binary.  In
      both [2] and [3], the residues matched.
[4] - The speed increase in secs/iter between my binary and Ernst's
      binary.  Ernst - what compiler options did you use?
     
Simon.
_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

------------------------------

Date: Sun, 22 Aug 1999 15:12:26 +1000
From: Simon Burge <[EMAIL PROTECTED]>
Subject: Re: Mersenne: LL Test Program verification data 

"Brian J. Beesley" wrote:

> I've posted on my ftp server a file which may be of interest to 
> anyone developing or verifying the operation of LL testing programs.
> 
> The URL of the file is 
> ftp://lettuce.edsc.ulst.ac.uk/gimps/PrimeQA/QADATA.TXT
> 
> The format is
> exponent,# iterations,residual,program id,max error
> 
> The residual listed is the hexadecimal representation of the low 
> orfer 64 bits of the complete residual. The program id is always 
> "lucdwt"; the max error can be ignored (basically it's there as a 
> self-check to make sure that a sensible FFT size was used to generate 
> the data).
> 
> The range of exponents is selected exponents between 8,000 and 
> 80,000,000. The low 64 bits of the residual after 400 iterations is 
> listed for every exponent in the file. (After one iteration the 
> residual should be 000000000000000E for every exponent). For 
> exponents less than 4,000,000, the residual after 1,000 iterations is 
> also listed. For exponents greater than 20,000,000, the residual 
> after 100 iterations is also listed.
> 
> The results were generated on a 533 MHz Alpha 21164LX system; the 
> program used is a version of lucdwt (included in Richard Crandall's 
> giantint package), modified to produce output in a suitable format.
> 
> I hope that this data assists anyone developing or modifying LL 
> testing software to ensure that their code is operating correctly. 

I've got some patches to Will's mers package to print out results that
can be compared with this.  For example, here's the first few lines
of QADATA.TXT:

        64511,100,EBA1B280B7C00FF7,lucdwt,0.000107
        64511,200,96F428A92412ADA8,lucdwt,0.000122
        64511,300,38498B59F3A614B7,lucdwt,0.000122
        64511,400,3389BC878321980A,lucdwt,0.000122
        64513,100,D24F0966C2E9B3C4,lucdwt,0.000092
        64513,200,2376508C726DFB47,lucdwt,0.000092
        64513,300,D4BA71A0879202D5,lucdwt,0.000092
        64513,400,FF4069B4F7F0BDA4,lucdwt,0.000095

and the start of a test run of MacLucasUNIX running on an UltraSPARC:

        M( 64511 )C, 0xeba1b280b7c00ff7, iter = 100, n = 4096, MacLucasUNIX v6.20  
Sweeney
        M( 64511 )C, 0x96f428a92412ada8, iter = 200, n = 4096, MacLucasUNIX v6.20  
Sweeney
        M( 64511 )C, 0x38498b59f3a614b7, iter = 300, n = 4096, MacLucasUNIX v6.20  
Sweeney
        M( 64511 )C, 0x3389bc878321980a, iter = 400, n = 4096, MacLucasUNIX v6.20  
Sweeney
        M( 64513 )C, 0xd24f0966c2e9b3c4, iter = 100, n = 4096, MacLucasUNIX v6.20  
Sweeney
        M( 64513 )C, 0x2376508c726dfb47, iter = 200, n = 4096, MacLucasUNIX v6.20  
Sweeney
        M( 64513 )C, 0xd4ba71a0879202d5, iter = 300, n = 4096, MacLucasUNIX v6.20  
Sweeney
        M( 64513 )C, 0xff4069b4f7f0bda4, iter = 400, n = 4096, MacLucasUNIX v6.20  
Sweeney

The tests are still running - when it is finished I'll write a little
perl script that compares the two files and prints out any error
messages.

Simon.
_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

------------------------------

Date: Sun, 22 Aug 1999 02:24:50 EDT
From: [EMAIL PROTECTED]
Subject: Mersenne: Re: LL Test Program verification data

Brian Beesley writes:

>I've posted on my ftp server a file which may be of interest to 
>anyone developing or verifying the operation of LL testing programs.

Mlucas users please note: my code treats the initial residue (4) as
"iteration one," rather than the more-common zero. (I guess that makes
it obvious I'm a Fortran guy, eh? :) Thus, to compare to, e.g. Brian's
400-iteration residue for a given exponent, you'd need to specify 401
iterations using the Mlucas timing test feature.

(I'd actually prefer Brian use 10-iteration residues, since then Mlucas
would literally "go to eleven," but I realize ten iterations is too few
for a decent QA test.)

Also, Mlucas requires the exponent to be prime (actually a strong probable
pseudoprime), whereas most of the exponents in Brian's QA file are composite.

Brian, any chance you could flag the prime exponents in the file? (you could
write a small code to do this automatically.)

Cheers,
Ernst

_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

------------------------------

Date: Sun, 22 Aug 1999 02:24:49 EDT
From: [EMAIL PROTECTED]
Subject: Mersenne: Re: DS20 timings

Simon Burge writes:

>I'm curious - there seems to be definite patterns in the sections of
>code (especially when comparing the different length radix subroutines).
>Is any of this code machine generated, or is it all hand generated?

All hunt-and-peck hand-generated. Of course many sections of code are
very similar, so much was created via cut-paste-modify. That's what I
meant about the out-of-place transform strategy being nice from a coding
and debugging (and now, speed) perspective.

The best place to read about the speedups and view the evolution of the
code is the program header.

>In the past I attempted to get a C version of your code working but didn't
>quite get there, and then Gord Palameta did a version last year.  I was
>thinking of taking a stab at trying 2.6b again...

I agree, it would be nice to have a C version, but as long as there are
executables for the major platforms, it's not crucial. I'll have a Linux
binary hopefully by middle of next week. Since we have Alpha Unix and SGI
Irix binaries, the major voids are SPARC and the PowerPC (and its successor,
the AltiVec). Hopefully when Alex Kruppa returns
to a quasi-normal routines (and passes his TU Mu"nchen exams), one of the
three f90 compilers he said he'd try on his Ultra will prove to be a decent
one.

Of course, PPC and SPARCers can (and should) use MacLucasUnix in the meantime.

There's also Jason Papadopoulos' potential LL code, but even with lots of
encouragement it'll likely take him months to modify his Fermat number code
appropriately, and even longer to add non-power-of-2 runlengths. As anyone
who has coded such algorithms knows, it's highly nontrivial, especially
when you need the code to not just run, but to run fast. (I've been working
on Mlucas for nearly three years - of course much of year 1 was spent just
learning the basics of FFTs, the LL test, and the DWT.)

>Here's some more times with a few explainitory notes below:

Thanks! Those look very good, no worse than a few percent worse than
MacLucasUnix at any power-of-2 runlength, and quite a bit faster at most
of the intermediate lengths and at 1024K.

>[4] - The speed increase in secs/iter between my binary and Ernst's
>      binary.  Ernst - what compiler options did you use?

I didn't use -arch ev6, used -O4 rather than -O5 (-O5 runs slower on my
21064, where I compile) and didn't use -fast (also a tad slower on my ev4).

I'll compile using your options (at least for the ev6 executable) sometime
in the next few days. The difference is slight, but I know every bit counts-
e.g. for my Fermat-DWT code, every 1% speedup saved nearly 2 days of runtime
on F24 (on a 250 MHz MIPS R10000).

Cheers and happy hunting,
Ernst

_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

------------------------------

Date: Sun, 22 Aug 1999 21:45:31 EDT
From: [EMAIL PROTECTED]
Subject: Mersenne: K7 vs. x86

There have been some recent posts about the AMD K7 (Athlon), and
whether properly tuning the Prime95 code for the K7 could result in
better performance than on a similar-clock-rate Pentium. Of course
George W. can comment definitively, but in theory the K7 should be
capable of delivering rather better performance. Here's my reasoning:

1) The K7 has essentially the same ultra-high-performance bus as the
Alpha 21264, which I believe is 128 bits wide and runs at 200MHz, thus
is capable of feeding the hungry processor with significantly more data;

2) The K7 has 3 functional units in its FPU, compared to just two for
every other high-end microprocessor I am aware of. All high-end processors
have a floating adder and multiplier. I'm not sure what the third unit on
the K7 does, but I suspect it's either a second floating adder, or perhaps
capable of doing a fused multiply-add. In either case, the, um, added add
capability could be exploited to speed an FFT. That's because a typical
higher-radix FFT implementation is dominated by adds rather than multiplies.
For example, a typical complex radix-16 FFT pass in my Mlucas code
takes 16 complex data (32 8-byte floats), and including multiplies by
"twiddle" factors (FFT sincos data) does 168 FADDs and 88 FMULs on them-
that's nearly twice as many adds as multiplies, i.e. the floating multiplier
is idle nearly half the time. If one could do two adds per cycle, one could
remedy that imbalance and (neglecting the fact that memory traffic is also
responsible for many processor stalls) in theory nearly double the speed.

I've always thought the cheapest way to improve performance of a transform-
based code (since floating hardware is very expensive) would be to design
a floating adder that could take two data x and y and return both the sum
and difference, x+y and x-y, in one clock. This would be a much cheaper way to
speed an FFT than adding an entire third unit to the FPU, since much of this
pair of operations is the same (for instance, the exponent compare and 
mantissa
shift are the same for the sum and diference, i.e. need only be done once.)

One would think, e.g. a floating adder and a fused add/multiply unit (in lieu
of the pure multiplier) might be capable of giving a similar performance 
boost,
but in practice it appears difficult or impossible to schedule the above
operations so as to be doable in <= 100 clocks with such a combination.
Of course people don't just run FFTs, which is why fused multiply-add is
preferred by hardware designers over fused add-subtract.

- -Ernst

_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

------------------------------

Date: Sun, 22 Aug 1999 21:30:21 -0600
From: "Aaron Blosser" <[EMAIL PROTECTED]>
Subject: RE: Mersenne: K7 vs. x86

> 2) The K7 has 3 functional units in its FPU, compared to just two for
> every other high-end microprocessor I am aware of. All high-end processors
> have a floating adder and multiplier. I'm not sure what the third unit on
> the K7 does, but I suspect it's either a second floating adder, or perhaps
> capable of doing a fused multiply-add. In either case, the, um, added add
> capability could be exploited to speed an FFT. That's because a typical
> higher-radix FFT implementation is dominated by adds rather than
> multiplies.

I had discussed the Athlon briefly with George, and his understanding was
that Athlon could do an fadd and an fmul in the 2 FPU units at the same
time.  This, according to him, is not that useful unless you could really
work hard to make sure both engines are constantly fed with work.

I've confirmed this architecture on the Athlon (visit
http://www.amd.com/products/cpg/athlon/pdf/fpu_wp.pdf for a nice overview of
the Athlon FPU).

Sure enough, there are 3 FPU units.  One that does fadd (and 3dnow), one
doing fmul (and 3dnow) and one doing fstore (loads/stores/etc).

I wonder if those benchmarks, like spec_fp, run in such a way that they do
mixes of fadds and fmuls, keeping all 3 pipes constantly busy.  If so (and I
bet it does), no wonder the Athlon kicks butt on the FP benchmarks (~137%
faster for a PIII-550 vs. Athlon 550 on specfp_base95).

But as we can see from the real world benchmarks, even though the Athlon
still does better, clock for clock, than a PIII, the difference isn't as
great (only ~107% faster on Winbench 99 FPU Winmark).  The benchmark was
done with a PIII Xeon with 512KB L2 cache.  I wonder if having a Xeon with a
2MB cache (or even 1MB) would have made this any closer since the Winbench
test is a lot larger than the specfp tests, and a larger cache would likely
have more impact.  Hmmm...

So, the question I put out is, can the algorithm be tweaked in any way so
that fadds and fmuls can be done together, or is it, as George initially
suspects, worthless?

Having the pipe just for doing fstore is probably where we'd see most of the
performance benefits, and according to the AMD supplied chart, the execution
time of fmul is less (4=Athlon, 5=PIII), however fadd takes longer on the
Athlon (4=Athlon, 3=PIII).  If you wanted to do extended double fdivs, the
Athlon would definitely kick butt (24=Athlon, 37=PIII), and you see the same
thing with the fsqrt (extended) as another example.

If you were pipelining fmuls, the Athlon could spit them out in 1 clock
cycle (after the 4 cycle latency), compared to 2 cycles (after the 5 cycle
latency) on the PIII, so it would be REAL important to get lots of yummy
pipelined fmuls to the Athlon to really let it strut it's stuff.

Aaron

_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

------------------------------

Date: Mon, 23 Aug 1999 00:15:14 EDT
From: [EMAIL PROTECTED]
Subject: Mersenne: How fast can we square? (Was: K7 vs. x86)

What I neglected to state explicitly in my last posting is that the fact that
there is a lower limit to the adds in an FFT also sets a limit on the 
achievable
performance. Let's consider a Mersenne-mod DWT-based squaring with a (real) 
vector
length of 2^20, or just over a million, performed using 64-bit floats on a CPU
which can execute one floating add per clock cycle (i.e. most processors).

When one includes the FFT twiddle multiplies, a typical number is that we need
3 adds per float per power of 2 in the vector length, i.e. 30 per float to 
effect
a length-2^20 transform. Going to higher radices and highly optimized 
transforms
shaves a little bit off this figure, but mainly has the effect of reducing the
number of floating multiplies and allowing one to do lots of operations while
data are in registers, i.e. of the processor well fed. We thus estimate 60
FADDs per datum for a forward FFT, and an equal number for the inverse FFT.
Throw in some adds for the pointwise square and the carry step, and 135-150
adds per float per iteration sounds reasonable. We thus need to do around
150 million adds per iteration. On a 450 MHz CPU, even if one does the 
theoretical
maximum of one add per clock, each iteration will need a third of second. That
math applies to the IA-64, so we can expect good performance from the code 
George
is presumably writing for that CPU, but not miracles - George tells me his 
Prime95
code needs about 2/3 second to do a length-2^20 squaring on a 450MHz PII, so 
we
can expect at most a factor of 2 speedup on a similar-speed Merced.

Looking at my Mlucas code, on a 500MHz Alpha 21264 (also at most one add per
clock) it needs 0.4 seconds to do the above squaring, which is not much longer
than the estimated minimum time of 0.3 seconds. This tells me that a huge ASM
coding effort targeted at the 21264 is not justified, since one can't hope to
squeeze much more speed out of the code that way. On the other hand, the code
runs 3x as slow on an equal-speed 21164, so trying to boost the performance
on that architecture by improving the source code and doing some assembly 
coding
should be worth some effort.

- -Ernst

_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

------------------------------

Date: Mon, 23 Aug 1999 00:31:55 -0400 (EDT)
From: Jason Stratos Papadopoulos <[EMAIL PROTECTED]>
Subject: Re: Mersenne: K7 vs. x86

On Sun, 22 Aug 1999 [EMAIL PROTECTED] wrote:

> For example, a typical complex radix-16 FFT pass in my Mlucas code
> takes 16 complex data (32 8-byte floats), and including multiplies by
> "twiddle" factors (FFT sincos data) does 168 FADDs and 88 FMULs on them-
> that's nearly twice as many adds as multiplies, i.e. the floating multiplier
> is idle nearly half the time. If one could do two adds per cycle, one could
> remedy that imbalance and (neglecting the fact that memory traffic is also
> responsible for many processor stalls) in theory nearly double the speed.

I've wondered for a while if the DFT can be re-factorized to have *more*
multiplies and fewer total ops. From the doodling I've done, though, it
seems that the ratio can never be better than 3:2 adds to muls.

There was a paper in (I believe) Appl. Math and Comp. from about two
years ago that reformulated radix-2,3,4 and 5 FFT butterfiles to use
multiply-adds wherever possible. The savings can be impressive: a radix-2
FFT butterfly has 4 multiplies and six adds, but this boils down to
six multiply-adds. Can't the RS6000 do two multiply-adds per clock?

Or, if you just want to cheapen a complex multiply, turn the standard
form 

   xr,xi * yr,ri ->   xr*yr-xi*yi, xr*yi+xi*yr

into

   xr,xi * yr,ri ->   xr*(yr - xi/xr * yi), xr*(yi + xi/xr * yr)

and precompute xr and xi/xr. Presto! 2 multiplies and 2 multiply-adds.

> I've always thought the cheapest way to improve performance of a transform-
> based code (since floating hardware is very expensive) would be to design
> a floating adder that could take two data x and y and return both the sum
> and difference, x+y and x-y, in one clock. This would be a much cheaper way to
> speed an FFT than adding an entire third unit to the FPU, since much of this
> pair of operations is the same (for instance, the exponent compare and 
> mantissa
> shift are the same for the sum and diference, i.e. need only be done once.)

Analog Devices' SHARC DSP chips have a fused add-subtract (and a fused
multiply add); a 50MHz SHARC can do an FFT in only twice as long as a
300MHz PPC 604e (sorry, single precision only).

jasonp

_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

------------------------------

Date: Mon, 23 Aug 1999 11:31:37 -0500
From: "Willmore, David" <[EMAIL PROTECTED]>
Subject: RE: Mersenne: K7 vs. x86

> From: [EMAIL PROTECTED] [SMTP:[EMAIL PROTECTED]]
> 1) The K7 has essentially the same ultra-high-performance bus as the
> Alpha 21264, which I believe is 128 bits wide and runs at 200MHz, thus
> is capable of feeding the hungry processor with significantly more data;
> 
Sorry, system data bus from Athlon to core logic is SDATA[63:0].  It's the
signal levels and most of the overall logical design that the K7 takes from
the EV6 (21264).  Now, AMD has promised server versions of the Athlon and
hinted that they might need a different slot connector.  This would fit with
a wider data bus.  Not, the 21264 cannot plug into the 'slot A' connector
without losing half of it's data bus width.  Isn't there a 21264PC or some
such?  I believe those were half width data bus types of chips.  One of them
might work in a slot A.  Keep in mind, that the Athlon bus as it is in slot
A is already a match for current generation SDRAM.  Of course, DDR-SDRAM
will be out soon as will higher clock rate SDRAM.  They could always
block/interleave the memory, too.


Cheers,
David 
_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

------------------------------

Date: Mon, 23 Aug 1999 20:28:59 +0100
From: "Brian J. Beesley" <[EMAIL PROTECTED]>
Subject: RE: Mersenne: K7 vs. x86

On 22 Aug 99, at 21:30, Aaron Blosser wrote:

> But as we can see from the real world benchmarks, even though the Athlon
> still does better, clock for clock, than a PIII, the difference isn't as
> great (only ~107% faster on Winbench 99 FPU Winmark).  The benchmark was
> done with a PIII Xeon with 512KB L2 cache.

This is _still_ remarkable, since the "consumer" Athlons starting to 
trickle onto the market have 64-bit 100 MHz FSB and 512KB L2 cache, 
like PII / PIII / Xeon, but run their L2 cache at only 1/3 clock 
speed (c.f. full clock speed for Xeon & 1/2 clock speed for PII/PIII)

The high performance Athlons with 128 bit FSB @ 200 MHz & larger, 
relatively faster L2 cache should be really impressive. Maybe 
starting to approach what Alpha has been doing for a while ;-) (The 
critical difference here is that Athlon does run native IA32 code!)

> So, the question I put out is, can the algorithm be tweaked in any way so
> that fadds and fmuls can be done together, or is it, as George initially
> suspects, worthless?

Hmm. Is is not possible to use two "threads" (NOT in the OS sense - I 
mean working on two streams of data, interleaved) & arrange them so 
that the FMULs for stream 1 happen when stream 2 is doing FADDs on 
independent data, & vice versa? I think you should be able to keep 
the pipes pretty full ... though whether you can do a _lot_ better 
than a decent compiler depends on how smart the instruction prefetch 
/ decode / scheduling in the processor core is. You probably _won't_ 
be able to achieve 100% utilization of all the pipelines all the time 
- - I've a sneaky suspicion that George's Pentium code drives the 
memory bus quite hard already; feeding a faster, more efficient 
processor from the same bus bandwidth is going to make it more likely 
that bus saturation, rather than processor throughput capability, 
will limit the ultimate performance of the system as a whole.
> 
> If you were pipelining fmuls, the Athlon could spit them out in 1 clock
> cycle (after the 4 cycle latency), compared to 2 cycles (after the 5 cycle
> latency) on the PIII, so it would be REAL important to get lots of yummy
> pipelined fmuls to the Athlon to really let it strut it's stuff.

Am I missing something here? I thought that the _throughput_ was 1 
FMUL per clock, but there was a 4 clock period between the 
instruction entering the execution unit (meaning that it has already 
had to be prefetched, decoded and the operands made available) and 
the result of the operation becoming available. So, provided there is 
no delay in fetching instructions, there is capacity in the decoders 
and there is no stall due to operands being unavailable, you _should_ 
get a throughput of 1 FMUL per clock (assuming that you aren't also 
scheduling other instructions which block the multiplier execution 
unit, or use its pipeline).

Regards
Brian Beesley
_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

------------------------------

Date: Mon, 23 Aug 1999 16:14:02 -0500
From: "Willmore, David" <[EMAIL PROTECTED]>
Subject: RE: Mersenne: K7 vs. x86

> From: Brian J. Beesley [SMTP:[EMAIL PROTECTED]]
> This is _still_ remarkable, since the "consumer" Athlons starting to 
> trickle onto the market have 64-bit 100 MHz FSB and 512KB L2 cache, 
> like PII / PIII / Xeon, but run their L2 cache at only 1/3 clock 
> speed (c.f. full clock speed for Xeon & 1/2 clock speed for PII/PIII)
> 
Well, they do run the point-to-point bus clock at 100MHz, but they
send/receive data on each clock transition, so it's 200M-transfers/sec.
And, I thought the production Athlons were using 1/2 speed cache--it was
only the pre-production (evaluation) processors which were using the 1/3
speed.  Of course, I could easily be wrong.

> The high performance Athlons with 128 bit FSB @ 200 MHz & larger, 
> relatively faster L2 cache should be really impressive. Maybe 
> starting to approach what Alpha has been doing for a while ;-) (The 
> critical difference here is that Athlon does run native IA32 code!)
> 
Yes, 400 M-transfers/second at 16 bytes each is a nice 6.4GB/s. :)  Run the
8M L2 at 1:1 with the processor at 800MHz and get twice that. :)  I can't
find a line in the document stating the width of the L2 bus, but I would be
suprised if it's < 128 bits.  256 for a server version would be nice. 12.8
GB/s to 25.6 GB/s, geezz....

> > If you were pipelining fmuls, the Athlon could spit them out in 1 clock
> > cycle (after the 4 cycle latency), compared to 2 cycles (after the 5
> cycle
> > latency) on the PIII, so it would be REAL important to get lots of yummy
> > pipelined fmuls to the Athlon to really let it strut it's stuff.
> 
> Am I missing something here? I thought that the _throughput_ was 1 
> FMUL per clock, but there was a 4 clock period between the 
> instruction entering the execution unit (meaning that it has already 
> had to be prefetched, decoded and the operands made available) and 
> the result of the operation becoming available. So, provided there is 
> no delay in fetching instructions, there is capacity in the decoders 
> and there is no stall due to operands being unavailable, you _should_ 
> get a throughput of 1 FMUL per clock (assuming that you aren't also 
> scheduling other instructions which block the multiplier execution 
> unit, or use its pipeline).
> 
I believe what he's saying is that there is a bit of 'granularity' to the
pipe in the PII.  It sounds like a FMUL and only enter the pipe every other
cycle--to emerge five cycles later.  If so, that's what used to be called
'superpipelined'.  Hmmm, no, that would be backwards.  Maybe it's
'subpipelined'.  Think of it as a 2.5 cycle long pipe running at half core
speed.  This can result from a not fully pipelined stage in the middle of
the pipe.  Say single precision goes:

stageA, stageB, stageC,

but double precision goes:

stageA, stageB, stageB, stageC, stageC

That way, stage B is used for two successive cycles by the same operand.
This will force stage A to stall waiting for the following stage to clear.
This isn't all that unlikely when normally single precision stages are
available.

Cheers,
David


_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

------------------------------

Date: Tue, 24 Aug 1999 00:15:37 -0700
From: Luke Welsh <[EMAIL PROTECTED]>
Subject: Mersenne: Ramanujan on a bad day

via a correspondent in Brasil...

http://www.austms.org.au/Gazette/Nov97/brown.html

_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

------------------------------

End of Mersenne Digest V1 #618
******************************
Mersenne Digest V1 #618

Reply via email to