As I noted in my previous message, a beta of Mlucas v2.7 is now available.
ftp://209.133.33.168/pub/mayer/README has details. There is source code
and binaries for selected platforms (Alpha/Unix, SGI Irix). Alex Kruppa
mentioned that he may have access to the SPARC F90 v2 compiler (hopefully
better than v1), in which case we may also have SPARC binaries soon.
I'll post a Linux binary once any near-term bugfixes have been taken care of.
(I don't want to bug my contacts at Compaq (compaqts?) more than necessary.)

What's new? Here are the major items:

1) The code now uses an in-place FFT, as well as getting all its FFT sincos
   and DWT weights data via small-table multiplies, rather than by reading
   large arrays. Overall the memory needed is 1/4 that of v2.6 - just a bit
   over 8 MB at FFT length 1024K, for example. This greatly improves cache
   utilization and leads to nice performance gains - 20-40% faster on Alpha
   (21064 and 21164 - haven't had a chance to try the 21264 yet), 10-25%
   faster on MIPS.

2) Filename conventions have been changed to be more similar to Prime95.

3) The code now reads just the first entry in the worktodo.ini file at
   the start of each new run, allowing users to add/delete/modify other
   entries in the file at will.

4) The code now saves each run's statistics (what used to get written to
   the status file and then deleted at the end of the run) in its own
   file. This will make it easier to compare timings of future versions
   and (in case of results later shown to be bad) pinpoint where problems
   occurred.

5) The code can test exponents up to 78.3 million.

Note that v2.7 savefiles are incompatible with v2.6 - I didn't want to add
the complicated logic that would be needed to allow for backwards 
compatibility.

OK, now for the stuff users really want to see - here are timings on Alpha
21064, 21164 and MIPS R10000, the two platforms I have easy access to
(You can compare these to the v2.6 timings I posted on 21 August):

                                Platform/per-iteration time (sec)
        200MHz21064 400MHz21164 195MHz R10K 250MHz R10K
        cache sizes 8KB L1  32kB L1 32kB L1
        unknown 96KB mixed
                512KB L2    4MB L2  1MB L2       
FFT length: ----------  ----------  ----------  -------------
  64K       .095        .035        .041        .035
  80K       .12     .045        .054        .047
  96K       .16     .057        .069        .062
 112K       .19     .069        .082        .074
 128K       .21         .078        .100        .090
 160K       .27         .098        .118        .115
 192K       .32         .115        .143        .144
 224K       .39         .140        .170        .170
 256K       .48         .177        .221        .210
 320K       .65         .241        .261        .248
 384K       1.06        .316        .345        .317
 448K       1.29        .399        .388        .354
 512K       1.39        .545        .525        .451
 640K       1.88        .620        .649        .543
 768K       2.35        .756        .814        .659
 896K       2.73        .890        .932        .771
1024K       2.96        1.20*       1.16        .937
1280K       3.20        1.32        1.40        1.13
1536K       4.15        1.86*       1.90*       1.54*
1792K       4.99        2.13        2.04        1.68
2048K       5.45        2.73        2.57        2.22
2560K       6.93        3.16        3.25        2.61
3072K       8.33        4.02        3.92        3.16
3584K       9.96        4.53        4.58        3.69
4096K       11.42       5.62        6.14        7.26*

Thus, things are looking pretty good for runlengths of current interest
(64-1024K). I've only begun playing with optimization of the new in-place
scheme - this should help the really big runlengths a lot.
The only obviously anomalous timings are marked with asterisks and are:

a) 400MHz 21164 at 1024K (reason unknown);
b) 250MHz R10K  at 4096 (weird, but unlikely to affect your current work :)
c) 21164and both R10K's at 1536K - until I get around to writing a set of
   radix-12 pass routines, this FFT length needs one more pass through the
   data than its neighbors. Strangely, this doesn't hurt on the 21064.
  (Not that you'd want to ever run a full LL test with 1536K FFT on a 21064.)

        *       *       *       *

In the coming weeks I'll flesh out the above table, e.g. by adding timings
for more platforms as well as comparisons with Prime95 v19 and MacLucasUNIX.
That will be the basis for a timings webpage. I can also add timings for
other codes, but to keep things from getting out of hand, I plan to use
at least a few reasonable criteria a candidate code should meet to get
included on the timings page. This may seem heavyhanded, but we need to
allow non-Intel clients to quickly determine what works best on their
hardware, rather than just telling them to pick one of the dozen-odd codes
from the "other available source code" page, which lists no performance
benchmarks. Here is what I propose - your comments are welcome:

1) Must allow exponents up to 20 million (this will keep shifting as GIMPS
   work progresses).

2) Must exhibit a relative performance index (RPI) of at least 33% for at
   least half the non-factored exponents in the range

    {lower limit of current double-checking} <= p <= {limit set in (1)},

on at least one reasonably popular platform (say, at least 10000 such CPUs
have been sold). The RPI for an exponent p is defined as follows:

         (time for Prime95 to test M(p) on x86) *(clock rate of x86)
RPI(p) = ------------------------------------------------------------- x 100%
         (time for code X to test M(p) on CPU Y)*(clock rate of CPU Y)

For example, Mlucas 2.7 at 384K (the accuracy is similar enough to Prime95
to allow us to just consider FFT length - if code X is significantly less
accurate or allows just power-of-2 FFT lengths, we may have to compare
different FFT lengths in the above formula) takes .316 sec. George gives
a time of .211 sec on his 400MHz PII for the same FFT length. Since the
two CPU clock rates are the same, Mlucas has an RPI of (.211/.316)x100%
or about 67%, meaning that at that runlength, it performs about 67% as
well on the 21164 as Prime95 on the PII.

For the same FFT length, on the 250 MHz R10K, we have nearly the same
per-iteration time as on the 400MHz 21164, but the clock rate is lower:

      .211 * 400
RPI = ---------- x 100% = 106%,
      .317 * 250

meaning that the MIPS performs slightly better than Prime95 running on
a PII with the same clock speed.

3) Source code (except possibly for things like validation keys) freely
   available.

Cheers,
-Ernst

p.s.: As this message is rather long, if you reply to the list, *please* don't
append the whole thing.

_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

Reply via email to