Mersenne: Re: DS-20 timings

EWMAYER Fri, 20 Aug 1999 18:11:11 -0700
Dear All: I'm catching up on lots of postings, so forgive me if this
is long-winded.

Simon Burge writes:

>Compaq have a DS20 Alpha with 2 500MHz 21264 CPUs on the internet
>for people to try out.

This is a dual-CPU version of the same kind of machine (a.k.a. ev6
or 21264 - the naming profusion keeps increasing with each new
generation, apparently) David Willmore used to verify M#38 using a
beta of my Mlucas v2.6 code (formerly lucas_mayer - Michael Taylor,
who used to maintain the GIMPS source code page, picked that name,
whereas I prefer just Mlucas).
We got timings competitive with Prime95 on the fastest available
Pentia, and that was using HLL compiled code with no architecture-
specific tunings. The 21264 is a sweet piece of hardware, indeed.

>Ernst - since nigel is no more, where can I get the latest f90
>code? I've got 2.5b, and it's giving me some errors:

I'm not sure why it's error-exiting, but since you compiled locally
I suspect overly aggressive compile options. I n particular if you
use -fast, you must also use -assume accuracy_sensitive, to keep
the compiler from eliminating the (x+rnd)-rnd operations used to
effect a fast NINT in the carry phase. Also, you might try both
-O4 and -O5: the latter sometimes gives slower executables, in a
platform-dependent way, and should be used with caution.

In any event, you can get the new improved version 2.6b via

ftp://209.133.33.182/pub/mayer/Mlucas_2.6b.f90.gz

(David Willmore is the only person who got the short-lived v2.6a;
David, 2.6b is about 10% faster, so you may want to grab it now.)

If you don't have an F90 compiler or don't want to compile locally,
binary executables for Alpha Unix (formerly OSF/1) and SGI are also
available in the /bin directory of the above site. Please see the
README file for more info.

I am waiting for a Linux executable, hopefully by middle of next
week. Alex Kruppa has several different F90 compilers on his SPARC
which he said he'd try out on the code, but as his father passed
away a few days ago, he surely has more pressing concerns.

One can only hope there's now/soon a better F90 compiler for SPARC
than their dismal V1 effort - that won't even do 64-bit loads and
stores, even when one specifies such in the options!

Aaron Blosser writes:

>ps - Is anyone else besides me happy that Compaq is ditching
>that boring old "computer beige" in favor of the "opal" (basically
>white) color for their servers?  I think they look snazzy.

My DEC 21164 server is "Top Gun blue" - is that faster?

Brian Beesley writes (about MacLucasUnix):

>I find, running MLU on a Alpha 21164-533, 128K FFT works up to about 
>exponent 2.35 million, & pro rata. MLU on a Sparc seems to be able to 
>run a bit higher, somewhere around 2.45 million seems to be OK for a 
>128K FFT.

Hmm, those upper limits seem a bit low. Does MacLucasUnix tell you when
the exponent is too large? Some related postings in the last digest
seem to show people using exponents much too large for a given FFT size
but getting no error messages - that would be bad. For comparison,
Mlucas, on machines that support real*16 sincos inits (Alpha and SGI),
can go up to the following p's (I omit 160, 192 and 224K for brevity):

size: 128K   256K   320K   384K   448K   512K   640K   768K   896K  1024K
pmax: 2.62M  5.20M  6.46M  7.71M  8.96M  10.2M  12.6M  15.1M  17.5M  20M

On strictly real*8-type hardware, the upper limits are about 1% less.
The only cost of real*16 inits is a greater initialization time.

OK, now down to the nuts and bolts of per-iteration timings: I'll list
only my own timings of Mlucas 2.6b here - since I'll be maintaining the
GIMPS timings page, I hope people will keep sending/posting timings for
various platforms. In particular I'd like MacLucasUnix timings for SGIs
as similar as possible to the ones below.

SUMMARY: in general it appears that MacLucasUnix is somewhat faster than
Mlucas at a given power-of-2 length. Most of the difference appears to be
due to the fact that MacLucasUnix uses an in-place transform strategy
like Prime95, whereas Mlucas uses an out-of-place transform, thus needing
about double the memory and spilling over into RAM for FFT lengths where
MacLucasUnix still just fits into the L2 cache.

On the other hand, I found that using an out-of-place transform scheme
made the code much easier to debug, and also eased the adding of non-
power-of-2 FFT lengths, which MacLucasUnix lacks. Thus, in the lower half
of each power-of-2 interval, Mlucas will be faster, the two codes should be
roughly equal for the third quarter of the interval, and MacLucasUnix
is faster for the uppermost quarter.

                       Platform/per-iteration time (sec)
            200Mhz 21064   400MHz 21164   195MHz R10000   250MHz R10000
            cache sizes    8kB D-cache    32kB D-cache    32kB D-cache
            unknown        96kB mixed I/D                 
                           512kB L2       4MB L2          1MB L2
FFT length: ------------   ------------   -------------   -------------
 128K       0.32           0.12           0.096           0.095
 160K       0.37           0.17           0.14            0.14
 192K       0.48           0.22           0.17            0.17
 224K       0.58           0.26           0.21            0.20
 256K       0.63           0.29           0.25            0.23
 320K       0.87           0.39           0.33            0.29
 384K       1.06           0.49           0.40            0.35
 448K       1.29           0.58           0.49            0.42
 512K       1.39           0.65           0.56            0.47
 640K       1.88           0.84           0.70            0.60
 768K       2.35           1.15           0.96            0.80
 896K       2.73           1.22           1.04            0.86
1024K       2.96           1.36           1.17            0.96

David Willmore's timings of the beta of Mlucas 2.6 indicate that the
code runs about 3 times faster on a 500MHz 21264 than on a 400MHz 21164,
so dividing the numbers in the 21164 column by 3 should yield a decent
estimate of 21264 timings until I can get some actual timing data.

Now, if we could squeeze gains out by assembly coding anywhere near those
Jason Papadopoulos obtained on the SPARC via the ASM route, that would
be impressive, indeed.

Cheers,
Ernst
_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers
Mersenne: Re: DS-20 timings

Reply via email to