Dear all: Here is the first installment of my head-to-head performance
comparison
for Mlucas and MacucasUNIX, on all three major generations of the Alpha
architecture. I'll send similar for MIPS and SPARC in the next few days.
If the alignments look weird, try using a true-type font.
-Ernst
--------------------
GIMPS SOURCE CODE PERFORMANCE CHART
The first (leftmost) column gives the number of 64-bit floats in the array
being transformed. The second column gives per-iteration timings of Prime95
v19 running on a 400MHz Pentium II, as reported by George Woltman
(www.mersenne.org/status.htm). The remaining columns list timings of other
fast LL testing codes running on various platforms.
In parentheses to the right of each time in seconds, I also list the
relative performance index (RPI, in %) with respect to Prime95 v19 running on
a 400MHz Pentium II. The RPI is the ratio of the speed of code X running on
hardware Y, to the speed of Prime95 running on a 400MHz PII, for an exponent
p,
and adjusted for any difference in the clock rate of the two CPUs being
compared. (For codes that allow a similar variety of FFT lengths and have
similar accuracy, we can use FFT length in place of exponent.) Since speed is
inversely related to per-iteration time, the RPI is defined as
(time for Prime95 to test M (p) on 400MHz PII) * 400 MHz
RPI (p) = ------------------------------------------------------------- x 100%
(time for code X to test M (p) on CPU Y)* (clock rate of CPU Y)
Example 1: On 400MHz Alpha 21164, Mlucas 2.7y at 384K takes .316 sec.
Prime95 needs .211 sec on a 400MHz PII for the same FFT length.
Since
the two CPU clock rates are the same, Mlucas has an RPI of
(.211/.316)x100% or about 67%, meaning that on the 21164 at that
runlength, it performs about two-thirds as well as Prime95 on PII.
Example 2: For Mlucas at the same FFT length on the 250 MHz R10K, we have
a per-iteration time of .292 sec, similar to the 400MHz 21164, but
the clock rate is lower, hence
.211 * 400
RPI = ---------- x 100% = 116%,
.292 * 250
meaning that on the MIPS, Mlucas performs somewhat better than
Prime95 running on a PII with the same clock speed.
Also note that for the above examples, the accuracy of the code is similar
enough to Prime95 (pmax about 1-2% less at a given FFT length) to allow us
to just consider FFT length - if code X is significantly less accurate or
allows just power-of-2 FFT lengths, we may have to compare different FFT
lengths in the above formula (e.g. for codes like MacLucasUNIX, which allows
only powers of 2 and jumps to the next power-of-2 runlength much earlier than
Prime95 and Mlucas.)
Abbreviations: MLU625 = MaclucasUNIX v6.25
MLF = MaclucasFFTW
P95 = Prime 95 v19
n/a = Length not available; must use next-higher power of 2.
Program, platform, cache sizes / per-iteration time in seconds
P95 Mlucas2.7y MLU625 Mlucas2.7y MLU625 Mlucas 2.7y MLU625
PII Alpha Alpha Alpha Alpha Alpha Alpha
400 21064/200 21064/200 21164/400 21164/400 21264/500 21264/500
L1: 8KB ? ? 8KB L1 8KB L1 64KB L1 64kB L1
L2: 512K ? ? 512KB L2 512KB L2 4MB L2 4MB L2
length ---- ---------- --------- ---------- --------- ----------- ---------
96K .045 .127 (70%) n/a .057 (79%) n/a .025 (138%) n/a
112K .055 .155 (71%) n/a .070 (80%) n/a .031 (142%) n/a
128K .060 .172 (70%) .312(38%) .077 (78%) .098(61%) .034 (141%) .036(133%)
160K .083 .223 (73%) n/a .099 (84%) n/a .044 (148%) n/a
192K .098 .272 (72%) n/a .120 (82%) n/a .052 (148%) n/a
224K .119 .345 (68%) n/a .146 (77%) n/a .064 (149%) n/a
256K .132 .370 (64%) .679(39%) .161 (82%) .220(60%) .069 (153%) .078(126%)
320K .173 .544 (63%) n/a .251 (69%) n/a .090 (150%) n/a
384K .211 .695 (60%) n/a .316 (67%) n/a .107 (153%) n/a
448K .252 .880 (57%) n/a .417 (60%) n/a .132 (153%) n/a
512K .281 1.03 (55%) 1.45(39%) .472 (60%) .459(61%) .146 (152%) .178(126%)
640K .372 1.32 (56%) n/a .648 (57%) n/a .207 (138%) n/a
768K .453 1.60 (57%) n/a .782 (58%) n/a .257 (133%) n/a
896K .536 1.93 (56%) n/a .920 (58%) n/a .326 (128%) n/a
1024K .600 2.14 (56%) 3.01(40%) .990 (61%) 1.07(56%) .363 (127%) .461(104%)
1280K .776 3.00 (52%) n/a 1.35 (57%) n/a .480 (122%) n/a
1536K .934 3.66 (51%) n/a 1.82 (51%) n/a .656 (108%) n/a
1792K 1.11 4.46 (50%) n/a 2.15 (52%) n/a .789 (110%) n/a
2048K 1.23 4.85 (51%) 6.50(38%) 2.36 (52%) 2.84(43%) .880 (111%) 1.36(72%)
2560K 1.64 6.42 (51%) n/a 3.20 (51%) n/a 1.23 (110%) n/a
3072K 1.99 7.73 (51%) n/a 3.89 (51%) n/a 1.48 (103%) n/a
3584K 2.38 9.23 (52%) n/a 4.57 (52%) n/a 1.79 (105%) n/a
4096K 2.60 10.2 (51%) 14.0(37%) 5.02 (52%) 7.42(35%) 2.01 (103%) 3.70(56%)
TIMINGS SUMMARY: The only place MacLucasUNIX outperforms Mlucas is on the
21164 at FFT length 512K, where MLU seems to benefit from a fortuitous cache
alignment - at lengths greater than this, things deteriorate rapidly.
ACCURACY SUMMARY: here are the FFT length/exponent breakpoints for the
three fastest codes. Prime95 is best, since it is able to take advantage of
the x86
80-bit floating-point register format. Mlucas is close behind, with a pmax
just 1-2% lower than Prime95 at each runlength. MacLucasUNIX is the worst of
the lot, even when compiled using (on Alpha Unix) -assume accuracy_sensitive
to prevent the compiler from overaggressive reordering of floating-point
operations (note that if you're compiling using -fast you MUST also use the
above -assume flag, otherwise Mlucas won't run and MacLucasUNIX won't be able
to do round-off checking, i.e. has no way of telling whether the FFT length
it is using is appropriate for the number under test.)
Maximum exponent (millions)
Prime95 Mlucas2.7 MLU625
length ------- --------- ------
96K 1.990 1.983 n/a
112K 2.323 2.310 n/a
128K 2.656 2.610 ~2.38
160K 3.290 3.260 n/a
192K 3.935 3.910 n/a
224K 4.598 4.550 n/a
256K 5.250 5.160 ~4.98
320K 6.515 6.420 n/a
384K 7.730 7.700 n/a
448K 9.020 8.950 n/a
512K 10.32 10.20 ~9.3
640K 12.83 12.70 n/a
768K 15.27 15.20 n/a
896K 17.85 17.60 n/a
1024K 20.40 20.10 ~18.8
1280K 25.33 25.00 n/a
1536K 30.10 29.80 n/a
1792K 35.10 34.70 n/a
2048K 40.25 39.40 ~37
2560K 49.90 49.10 n/a
3072K 59.40 59.10 n/a
3548K 69.00 68.50 n/a
4096K 79.30 78.00 < 75
MEMORY USAGE: Prime95 and Mlucas need little storage beside the LL residue
itself, i.e. (runlength x 8 bytes) + perhaps 10% extra for FFT sincos and
DWT weights tables and bit-reversal index arrays. MacLucasUNIX, on the other
hand, is a memory hog - at 4096K it needs a whopping 244MB, compared to just
33MB for Prime95 and Mlucas.
_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers