As I noted in my previous message, a beta of Mlucas v2.7 is now available. ftp://209.133.33.168/pub/mayer/README has details. There is source code and binaries for selected platforms (Alpha/Unix, SGI Irix). Alex Kruppa mentioned that he may have access to the SPARC F90 v2 compiler (hopefully better than v1), in which case we may also have SPARC binaries soon. I'll post a Linux binary once any near-term bugfixes have been taken care of. (I don't want to bug my contacts at Compaq (compaqts?) more than necessary.) What's new? Here are the major items: 1) The code now uses an in-place FFT, as well as getting all its FFT sincos and DWT weights data via small-table multiplies, rather than by reading large arrays. Overall the memory needed is 1/4 that of v2.6 - just a bit over 8 MB at FFT length 1024K, for example. This greatly improves cache utilization and leads to nice performance gains - 20-40% faster on Alpha (21064 and 21164 - haven't had a chance to try the 21264 yet), 10-25% faster on MIPS. 2) Filename conventions have been changed to be more similar to Prime95. 3) The code now reads just the first entry in the worktodo.ini file at the start of each new run, allowing users to add/delete/modify other entries in the file at will. 4) The code now saves each run's statistics (what used to get written to the status file and then deleted at the end of the run) in its own file. This will make it easier to compare timings of future versions and (in case of results later shown to be bad) pinpoint where problems occurred. 5) The code can test exponents up to 78.3 million. Note that v2.7 savefiles are incompatible with v2.6 - I didn't want to add the complicated logic that would be needed to allow for backwards compatibility. OK, now for the stuff users really want to see - here are timings on Alpha 21064, 21164 and MIPS R10000, the two platforms I have easy access to (You can compare these to the v2.6 timings I posted on 21 August): Platform/per-iteration time (sec) 200MHz21064 400MHz21164 195MHz R10K 250MHz R10K cache sizes 8KB L1 32kB L1 32kB L1 unknown 96KB mixed 512KB L2 4MB L2 1MB L2 FFT length: ---------- ---------- ---------- ------------- 64K .095 .035 .041 .035 80K .12 .045 .054 .047 96K .16 .057 .069 .062 112K .19 .069 .082 .074 128K .21 .078 .100 .090 160K .27 .098 .118 .115 192K .32 .115 .143 .144 224K .39 .140 .170 .170 256K .48 .177 .221 .210 320K .65 .241 .261 .248 384K 1.06 .316 .345 .317 448K 1.29 .399 .388 .354 512K 1.39 .545 .525 .451 640K 1.88 .620 .649 .543 768K 2.35 .756 .814 .659 896K 2.73 .890 .932 .771 1024K 2.96 1.20* 1.16 .937 1280K 3.20 1.32 1.40 1.13 1536K 4.15 1.86* 1.90* 1.54* 1792K 4.99 2.13 2.04 1.68 2048K 5.45 2.73 2.57 2.22 2560K 6.93 3.16 3.25 2.61 3072K 8.33 4.02 3.92 3.16 3584K 9.96 4.53 4.58 3.69 4096K 11.42 5.62 6.14 7.26* Thus, things are looking pretty good for runlengths of current interest (64-1024K). I've only begun playing with optimization of the new in-place scheme - this should help the really big runlengths a lot. The only obviously anomalous timings are marked with asterisks and are: a) 400MHz 21164 at 1024K (reason unknown); b) 250MHz R10K at 4096 (weird, but unlikely to affect your current work :) c) 21164and both R10K's at 1536K - until I get around to writing a set of radix-12 pass routines, this FFT length needs one more pass through the data than its neighbors. Strangely, this doesn't hurt on the 21064. (Not that you'd want to ever run a full LL test with 1536K FFT on a 21064.) * * * * In the coming weeks I'll flesh out the above table, e.g. by adding timings for more platforms as well as comparisons with Prime95 v19 and MacLucasUNIX. That will be the basis for a timings webpage. I can also add timings for other codes, but to keep things from getting out of hand, I plan to use at least a few reasonable criteria a candidate code should meet to get included on the timings page. This may seem heavyhanded, but we need to allow non-Intel clients to quickly determine what works best on their hardware, rather than just telling them to pick one of the dozen-odd codes from the "other available source code" page, which lists no performance benchmarks. Here is what I propose - your comments are welcome: 1) Must allow exponents up to 20 million (this will keep shifting as GIMPS work progresses). 2) Must exhibit a relative performance index (RPI) of at least 33% for at least half the non-factored exponents in the range {lower limit of current double-checking} <= p <= {limit set in (1)}, on at least one reasonably popular platform (say, at least 10000 such CPUs have been sold). The RPI for an exponent p is defined as follows: (time for Prime95 to test M(p) on x86) *(clock rate of x86) RPI(p) = ------------------------------------------------------------- x 100% (time for code X to test M(p) on CPU Y)*(clock rate of CPU Y) For example, Mlucas 2.7 at 384K (the accuracy is similar enough to Prime95 to allow us to just consider FFT length - if code X is significantly less accurate or allows just power-of-2 FFT lengths, we may have to compare different FFT lengths in the above formula) takes .316 sec. George gives a time of .211 sec on his 400MHz PII for the same FFT length. Since the two CPU clock rates are the same, Mlucas has an RPI of (.211/.316)x100% or about 67%, meaning that at that runlength, it performs about 67% as well on the 21164 as Prime95 on the PII. For the same FFT length, on the 250 MHz R10K, we have nearly the same per-iteration time as on the 400MHz 21164, but the clock rate is lower: .211 * 400 RPI = ---------- x 100% = 106%, .317 * 250 meaning that the MIPS performs slightly better than Prime95 running on a PII with the same clock speed. 3) Source code (except possibly for things like validation keys) freely available. Cheers, -Ernst p.s.: As this message is rather long, if you reply to the list, *please* don't append the whole thing. _________________________________________________________________ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
