Tony Gott <[EMAIL PROTECTED]> wrote: > How does the new version 2.8c compare with Prime95 beta-version 21.4 as far > as benchmarks is concerned?
I can't speak to Glucas' performance on the Itanium until Guillermo sends me the precise numbers, but I suspect it will be somewhat faster even than the Prime95 code for the P4, since the Itanium has better floating- point functionality: whereas the P4 (and most of the high-end RISC platforms) can do at most one floating add and one floating multiply per cycle, the Itanium can do any two of the following per cycle: floating add floating mul fused floating mul/add . That means that under ideal circumstances (equal number of fadd and fmul, no data dependencies to worry about) the Itanium could do twice as much floating--point arithmetic per cycle as the P4 or (say) the Alpha 21264. FFT code is of course far from ideal in these terms, as it tends to be dominated by floating adds, but simply the ability to do two of these per cycle could speed things up by 50% or more over a CPU which can do just one add per cycle. As for the Alpha (which is now apparently the second-fastest RISC platform for Mersenne testing), my timings of the soon-to-be-released Mlucas 2.7c and Guillermo's timings of Mlucas 2.8 indicate that on the 21264 the per-cycle performance is similar to Prime95 on the P4; on the older 21164, the performance is close to Prime95 on the P3. In a multiprocessor setup one will get significantly better performance that for Prime95 running on a multi-CPU setup, since the RISCs have much better bus bandwidth - I see no slowdown even when running an LL tests on each CPU of a quad-processor Alpha; Prime95 would slow down by 20-30% under similar circumstances. > It would be just fantastic if the full power of the Altivec in Apple G4 > could be utilised as it has been in the RC5 distributed project. Is this > ever likely? Indeed it would be wonderful, but don't get you hopes up too high - as Jason Papadopoulos mentioned, the AltiVec is designed for multimedia performance, which leads to rather different functionality than one needs for large-integer work. I still hold out hope that a mixed floating-integer approach (doing a floating FFT side-by- side with a modular integer transform, then combining the results to allow for more significant bits per convolution output digit) may someday allow both the PowerPC core and Altivec unit to work together to speed LL testing, but it would require some very intricate coding even if it did prove feasible. And of course, it remains to be seen whether one can achieve any gains this way even on processors which have truly awesome integer capabilities, such as the Alpha 21264 and the Itanium - besides the fact that the modular transform needs about twice as many operations as the floating one (one needs to do a lot of double-width multiplies and modding of intermediate results), there are highly nontrivial algorithmic issues still to be solved. As an example of the latter, consider the case where one does a floating FFT side-by-side with a modular transform over the gaussian (complex) integers modulo some small Mersenne prime (technically one is operating over the Galois field GF(q^2), where q is a Mersenne prime.) These fileds have a lot of very nice properties, such as the fact that a modular transform looks very much like the floating FFT, and arithmetic modulo a Mersenne prime can be done very efficiently via simple masks, shifts and adds. For power-of-two transform lengths i've actually implemented such a scheme and got it working - using GF(M61^2) for the modular math I was able to get roughly 50 bits per input digit, compared to around 19 for pure floating-point code. But the compiler I was using did a poor job interleaving floating and integer operations, and I suspect that this may be a problem generally, i.e. that most compilers are optimized for code dominated by either floats or ints, but not both. But let's assume we either find a compiler that is really good at keeping both the FPU and integer units of our target CPU busy. Next problem: for non-power-of-two runlengths, the outputs of the forward integer transform will occur in a fundamentally different ordering than for the FFT, i.e. this wonderful idea of the modular transform mirroring the floating FFT is out the window when it comes to the dyadic-squaring phase between the forward and inverse transforms. Assuming that this can be overcome, next problem: the normalization and carry propagation stage is extremely difficult to implement efficiently for such a scheme, since one can't simply round each floating output; rather, one first has to use the corresponding modular output to correct the bottom log2(q)/2 bits of the floating output, then do the normalization, then get the normalized floating modular data back from that, and so forth. It won't do to spend 30% of one's time in the carry phase if that wipes out any gains one gets from the mixed-transform approach. So there's lots to be done yet even to evaluate whether such a scheme is feasible (in the sense that it can be made to run fast) on *any* platform, to say nothing of one whose integer capabilities were devised with 8,16 and 32-bit multimedia applications in mind. I remain optimistic, but I've seen first-hand the coding and algorithmic work that is required, and it's no small amount. If it's any consolation, the pure floating-point codes non-x86 clients can use have improved by more than a factor of 2 in speed in the past 2 years, so it's not like the conventional FFT-based approach has stalled out in terms of efficiency. Cheers, -Ernst _________________________________________________________________________ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
