Tony Gott <[EMAIL PROTECTED]> wrote:

> How does the new version 2.8c compare with Prime95 beta-version 21.4 as far
> as benchmarks is concerned?

I can't speak to Glucas' performance on the Itanium
until Guillermo sends me the precise numbers, but I
suspect it will be somewhat faster even than the Prime95
code for the P4, since the Itanium has better floating-
point functionality: whereas the P4 (and most of the
high-end RISC platforms) can do at most one floating add
and one floating multiply per cycle, the Itanium can
do any two of the following per cycle:

floating add
floating mul
fused floating mul/add .

That means that under ideal circumstances (equal number
of fadd and fmul, no data dependencies to worry about)
the Itanium could do twice as much floating--point 
arithmetic per cycle as the P4 or (say) the Alpha 21264.
FFT code is of course far from ideal in these terms, as
it tends to be dominated by floating adds, but simply
the ability to do two of these per cycle could speed
things up by 50% or more over a CPU which can do just
one add per cycle.

As for the Alpha (which is now apparently the 
second-fastest RISC platform for Mersenne testing),
my timings of the soon-to-be-released Mlucas 2.7c and 
Guillermo's timings of Mlucas 2.8 indicate that on the
21264 the per-cycle performance is similar to Prime95
on the P4; on the older 21164, the performance is close
to Prime95 on the P3. In a multiprocessor setup one
will get significantly better performance that for
Prime95 running on a multi-CPU setup, since the RISCs
have much better bus bandwidth - I see no slowdown even
when running an LL tests on each CPU of a quad-processor
Alpha; Prime95 would slow down by 20-30% under similar
circumstances.

> It would be just fantastic if the full power of the Altivec in Apple G4
> could be utilised as it has been in the RC5 distributed project. Is this
> ever likely?

Indeed it would be wonderful, but don't get you hopes up
too high - as Jason Papadopoulos mentioned, the AltiVec
is designed for multimedia performance, which leads to
rather different functionality than one needs for
large-integer work. I still hold out hope that a mixed
floating-integer approach (doing a floating FFT side-by-
side with a modular integer transform, then combining
the results to allow for more significant bits per
convolution output digit) may someday allow both the
PowerPC core and Altivec unit to work together to speed
LL testing, but it would require some very intricate
coding even if it did prove feasible. And of course,
it remains to be seen whether one can achieve any gains
this way even on processors which have truly awesome
integer capabilities, such as the Alpha 21264 and the
Itanium - besides the fact that the modular transform
needs about twice as many operations as the floating
one (one needs to do a lot of double-width multiplies
and modding of intermediate results), there are highly
nontrivial algorithmic issues still to be solved. As
an example of the latter, consider the case where one
does a floating FFT side-by-side with a modular 
transform over the gaussian (complex) integers modulo
some small Mersenne prime (technically one is operating
over the Galois field GF(q^2), where q is a Mersenne
prime.) These fileds have a lot of very nice properties,
such as the fact that a modular transform looks very
much like the floating FFT, and arithmetic modulo a
Mersenne prime can be done very efficiently via simple
masks, shifts and adds. For power-of-two transform
lengths i've actually implemented such a scheme and
got it working - using GF(M61^2) for the modular math
I was able to get roughly 50 bits per input digit,
compared to around 19 for pure floating-point code.
But the compiler I was using did a poor job interleaving
floating and integer operations, and I suspect that this
may be a problem generally, i.e. that most compilers
are optimized for code dominated by either floats or
ints, but not both. But let's assume we either find a
compiler that is really good at keeping both the FPU
and integer units of our target CPU busy. Next problem:
for non-power-of-two runlengths, the outputs of the
forward integer transform will occur in a fundamentally
different ordering than for the FFT, i.e. this wonderful
idea of the modular transform mirroring the floating FFT
is out the window when it comes to the dyadic-squaring
phase between the forward and inverse transforms.
Assuming that this can be overcome, next problem: the
normalization and carry propagation stage is extremely
difficult to implement efficiently for such a scheme,
since one can't simply round each floating output;
rather, one first has to use the corresponding modular
output to correct the bottom log2(q)/2 bits of the
floating output, then do the normalization, then get
the normalized floating modular data back from that,
and so forth. It won't do to spend 30% of one's time
in the carry phase if that wipes out any gains one gets
from the mixed-transform approach. So there's lots to
be done yet even to evaluate whether such a scheme is
feasible (in the sense that it can be made to run fast)
on *any* platform, to say nothing of one whose integer
capabilities were devised with 8,16 and 32-bit 
multimedia applications in mind. I remain optimistic,
but I've seen first-hand the coding and algorithmic
work that is required, and it's no small amount.

If it's any consolation, the pure floating-point codes
non-x86 clients can use have improved by more than a
factor of 2 in speed in the past 2 years, so it's not
like the conventional FFT-based approach has stalled
out in terms of efficiency.

Cheers,
-Ernst

_________________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

Reply via email to