Jason Papadopoulos wrote:

> Moving to IA64 will be a much bigger challenge than simply rewriting
> half a meg of assembly language.

That's why I'm hoping the SGI open source compilers for IA-64 are decent
(see http://oss.sgi.com/projects/Pro64/ );
at least in the near-term one could then simply compile one of the HLLLL
(High-level-language Lucas-Lehmer; how's that for a tortured acronym? :)
codes on the IA-64 and hopefully get performance at least as good as
on the Alpha 21264. For instance Mlucas is already geared toward register-
rich architectures (hey, I'm a greedy guy), and I've also written some
extra radix-32 FFT-pass routines that yield marginal benefit on the 21264
(fewer passes through the data appear to be offset by slowdowns due to
register pressure), but which should incur little or no register
pressure on the IA-64. Guillermo Ballester Valor's recently announced
Ylucas code is quite similar, so we can try out both the SGI Fortran-90
and C compilers for IA-64 under Linux.

> Can you rearrange a real-valued FFT
> to use multiply-adds as much as possible? It could cut the operation count
> in half if you do, but to my knowledge no one has yet done so.

Note that the 21264 has no multiply/add instruction and can do at most
one FADD and one FMUL per cycle, but even so, Mlucas on 21264 gets nearly
double the performance on the 21264 as Prime95 on the PII. Thus one could
legitimately hope for performance that is better still on the IA-64, even
without major restructuring of the source code to take advantage of the
MADD instruction (or whatever it's called on the IA-64; MADD is the MIPS
mnemonic). Assume the complex twiddles phase at the start of each FFT-pass
data block processing run no faster with MADD than without (and this is
pessimistic.) While the rest of a typical higher-radix FFT pass is dominated
by adds, being able to do 2 of these per cycle (even without throwing in
an extra multiply) should produce a significant speedup.

I think we can gain a good idea of the potential performance from looking
at the IBM Power3 processor, which similarly can do any combination of 2
FADD/FMUL/MADD per cycle - a couple of months ago Nick Geovanis compiled
Mlucas (with no hardware-specific tweaks whatsoever) on such a machine
and got 40-50% better per-cycle performance than on the Alpha 21264
(although the latter still beats the Power3 in terms of runtime due
its much higher clock rate.) That translates to roughly 2.5-2.7 times
the per-cycle performance of Prime95 on the PII, which ain't bad at
all. again, the real question is how good the SGI compiler will be,
but those folks have a great track record when it comes to optimizing
compilers.

Does anyone know when the first IA-64 systems will start shipping for
the commercial market? Does anyone here know of a way of getting time
on a beta system (an actual system, not a simulator) before then?

Cheers,
-Ernst

_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

Reply via email to