Mersenne: Mlucas and MacLucasUNIX on Alpha

EWMAYER Wed, 6 Oct 1999 19:40:28 -0700
Dear all: Here is the first installment of my head-to-head performance 
comparison
for Mlucas and MacucasUNIX, on all three major generations of the Alpha
architecture. I'll send similar for MIPS and SPARC in the next few days.

If the alignments look weird, try using a true-type font.

-Ernst

--------------------

                  GIMPS SOURCE CODE PERFORMANCE CHART

The first (leftmost) column gives the number of 64-bit floats in the array
being transformed. The second column gives per-iteration timings of Prime95
v19 running on a 400MHz Pentium II, as reported by George Woltman
(www.mersenne.org/status.htm). The remaining columns list timings of other
fast LL testing codes running on various platforms.

In parentheses to the right of each time in seconds, I also list the
relative performance index  (RPI, in %) with respect to Prime95 v19 running on
a 400MHz Pentium II. The RPI is the ratio of the speed of code X running on
hardware Y, to the speed of Prime95 running on a 400MHz PII, for an exponent 
p,
and adjusted for any difference in the clock rate of the two CPUs being
compared.  (For codes that allow a similar variety of FFT lengths and have
similar accuracy, we can use FFT length in place of exponent.) Since speed is
inversely related to per-iteration time, the RPI is defined as

          (time for Prime95 to test M (p) on 400MHz PII) * 400 MHz
RPI (p) = ------------------------------------------------------------- x 100%
         (time for code X to test M (p) on CPU Y)* (clock rate of CPU Y)

Example 1: On 400MHz Alpha 21164, Mlucas 2.7y at 384K takes .316 sec.
           Prime95 needs .211 sec on a 400MHz PII for the same FFT length. 
Since
           the two CPU clock rates are the same, Mlucas has an RPI of
           (.211/.316)x100% or about 67%, meaning that on the 21164 at that
           runlength, it performs about two-thirds as well as Prime95 on PII.

Example 2: For Mlucas at the same FFT length on the 250 MHz R10K, we have
           a per-iteration time of .292 sec, similar to the 400MHz 21164, but
           the clock rate is lower, hence
 
                               .211 * 400
                         RPI = ---------- x 100% = 116%,
                               .292 * 250

           meaning that on the MIPS, Mlucas performs somewhat better than
           Prime95 running on a PII with the same clock speed.

Also note that for the above examples, the accuracy of the code is similar
enough to Prime95 (pmax about 1-2% less at a given FFT length) to allow us
to just consider FFT length - if code X is significantly less accurate or
allows just power-of-2 FFT lengths, we may have to compare different FFT
lengths in the above formula (e.g. for codes like MacLucasUNIX, which allows
only powers of 2 and jumps to the next power-of-2 runlength much earlier than
Prime95 and Mlucas.)

Abbreviations: MLU625 = MaclucasUNIX v6.25
               MLF = MaclucasFFTW
               P95 = Prime 95 v19
               n/a = Length not available; must use next-higher power of 2.

           Program, platform, cache sizes / per-iteration time in seconds

        P95   Mlucas2.7y MLU625    Mlucas2.7y MLU625    Mlucas 2.7y MLU625
        PII   Alpha      Alpha     Alpha      Alpha     Alpha       Alpha
        400   21064/200  21064/200 21164/400  21164/400 21264/500   21264/500
    L1: 8KB   ?          ?         8KB L1     8KB L1    64KB L1     64kB L1
    L2: 512K  ?          ?         512KB L2   512KB L2  4MB L2      4MB L2
length  ----  ---------- --------- ---------- --------- ----------- ---------
  96K   .045  .127 (70%)   n/a     .057 (79%)   n/a     .025 (138%)   n/a    
 112K   .055  .155 (71%)   n/a     .070 (80%)   n/a     .031 (142%)   n/a    
 128K   .060  .172 (70%) .312(38%) .077 (78%) .098(61%) .034 (141%) .036(133%)
 160K   .083  .223 (73%)   n/a     .099 (84%)   n/a     .044 (148%)   n/a    
 192K   .098  .272 (72%)   n/a     .120 (82%)   n/a     .052 (148%)   n/a    
 224K   .119  .345 (68%)   n/a     .146 (77%)   n/a     .064 (149%)   n/a    
 256K   .132  .370 (64%) .679(39%) .161 (82%) .220(60%) .069 (153%) .078(126%)
 320K   .173  .544 (63%)   n/a     .251 (69%)   n/a     .090 (150%)   n/a    
 384K   .211  .695 (60%)   n/a     .316 (67%)   n/a     .107 (153%)   n/a    
 448K   .252  .880 (57%)   n/a     .417 (60%)   n/a     .132 (153%)   n/a    
 512K   .281  1.03 (55%) 1.45(39%) .472 (60%) .459(61%) .146 (152%) .178(126%)
 640K   .372  1.32 (56%)   n/a     .648 (57%)   n/a     .207 (138%)   n/a    
 768K   .453  1.60 (57%)   n/a     .782 (58%)   n/a     .257 (133%)   n/a    
 896K   .536  1.93 (56%)   n/a     .920 (58%)   n/a     .326 (128%)   n/a    
1024K   .600  2.14 (56%) 3.01(40%) .990 (61%) 1.07(56%) .363 (127%) .461(104%)
1280K   .776  3.00 (52%)   n/a     1.35 (57%)   n/a     .480 (122%)   n/a    
1536K   .934  3.66 (51%)   n/a     1.82 (51%)   n/a     .656 (108%)   n/a    
1792K   1.11  4.46 (50%)   n/a     2.15 (52%)   n/a     .789 (110%)   n/a    
2048K   1.23  4.85 (51%) 6.50(38%) 2.36 (52%) 2.84(43%) .880 (111%) 1.36(72%)
2560K   1.64  6.42 (51%)   n/a     3.20 (51%)   n/a     1.23 (110%)   n/a    
3072K   1.99  7.73 (51%)   n/a     3.89 (51%)   n/a     1.48 (103%)   n/a    
3584K   2.38  9.23 (52%)   n/a     4.57 (52%)   n/a     1.79 (105%)   n/a    
4096K   2.60  10.2 (51%) 14.0(37%) 5.02 (52%) 7.42(35%) 2.01 (103%) 3.70(56%)

TIMINGS SUMMARY: The only place MacLucasUNIX outperforms Mlucas is on the
21164 at FFT length 512K, where MLU seems to benefit from a fortuitous cache
alignment - at lengths greater than this, things deteriorate rapidly.

ACCURACY SUMMARY: here are the FFT length/exponent breakpoints for the
three fastest codes. Prime95 is best, since it is able to take advantage of 
the x86
80-bit floating-point register format. Mlucas is close behind, with a pmax
just 1-2% lower than Prime95 at each runlength. MacLucasUNIX is the worst of
the lot, even when compiled using (on Alpha Unix) -assume accuracy_sensitive
to prevent the compiler from overaggressive reordering of floating-point
operations (note that if you're compiling using -fast you MUST also use the
above -assume flag, otherwise Mlucas won't run and MacLucasUNIX won't be able
to do round-off checking, i.e. has no way of telling whether the FFT length
it is using is appropriate for the number under test.)

        Maximum exponent (millions)
        Prime95  Mlucas2.7  MLU625
length  -------  ---------  ------
  96K   1.990    1.983      n/a
 112K   2.323    2.310      n/a
 128K   2.656    2.610      ~2.38
 160K   3.290    3.260      n/a
 192K   3.935    3.910      n/a
 224K   4.598    4.550      n/a
 256K   5.250    5.160      ~4.98
 320K   6.515    6.420      n/a
 384K   7.730    7.700      n/a
 448K   9.020    8.950      n/a
 512K   10.32    10.20      ~9.3
 640K   12.83    12.70      n/a
 768K   15.27    15.20      n/a
 896K   17.85    17.60      n/a
1024K   20.40    20.10      ~18.8
1280K   25.33    25.00      n/a
1536K   30.10    29.80      n/a
1792K   35.10    34.70      n/a
2048K   40.25    39.40      ~37
2560K   49.90    49.10      n/a
3072K   59.40    59.10      n/a
3548K   69.00    68.50      n/a
4096K   79.30    78.00      < 75

MEMORY USAGE: Prime95 and Mlucas need little storage beside the LL residue
itself, i.e. (runlength x 8 bytes) + perhaps 10% extra for FFT sincos and
DWT weights tables and bit-reversal index arrays. MacLucasUNIX, on the other
hand, is a memory hog - at 4096K it needs a whopping 244MB, compared to just
33MB for Prime95 and Mlucas.



_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers
Mersenne: Mlucas and MacLucasUNIX on Alpha

Reply via email to