On Mon, 12 Mar 2001 [EMAIL PROTECTED] wrote:

> version, where the near-term payoff was much surer. Now
> that I've squeezed out about as much as I think I can
> be reasonably done from the floating-point version, I've
> been thinking about the modular stuff again. Your
> experience with all-modular convolution on the 21264
> (I believe within a factor of 2 of the speed of floating
> FFT is what you said) gives me hope that a pure-integer
> over a field with nice arithmetic properties (such as
> the complex integers GF(q^2) as you mentioned, and
> choosing q a Mersenne prime makes modular arithmetic a
> lot nicer) can compete with floating-point transform on
> hardware with good integer support (multiply is the key
> in most instances.) What happened to me, as you noted,
> was a compiler which was awful at combining floating and
> integer code.

Well, here's how the timings stack up at present. The Mlucas column
gives optimal times for v2.7b, the ICL version gives the best times I can
manage for an all-integer Mersenne-mod squaring. Both runs were on an
otherwise idle DS10 Alphaserver with a 466MHz 21264 processor, 2MB of L2
cache and 128MB RAM on an 83MHz FSB.

        Mlucas  ICL
-------------------
256k    .0737   .161 sec
512k    .1721   .339
1024k   .3786   .714

The integer convolution uses a 61-bit prime of no special form. Since the
last version of the integer code I posted, I've tried a few tricks that
don't help much, and before I got buried at work I was working on
high-performance CRT code that may or may not shave off a little time.

I also considered an M61 convolution, and have some figures that may
help. For a schedule of instructions that includes loads and stores, the
21264 is supposedly capable of a sustained 3.25 integer operations per
clock. I've actually found that when you mix together multiplies, loads
and stores the best you can do is about 2.7 integer instructions per
clock. The FPU can issue 2 operations per clock, but when you throw in
FPU loads (which use the integer units) the maximum issue rate is 2.9
register writes per clock. With FPU stores thrown in the maximum
sustainable issue rate goes down to 2.3 FPU instructions per clock. 

I have vector-M61-complex-multiply code on paper that can theoretically
complete a complex multiply in 11-12 clocks (you need a 3-stage
pipeline), and as I remember it issues 2.6 instructions per clock. It
should be possible to reduce the integer issue rate slightly and fill all
the remaining gaps with FPU instructions, and have the two computations
finish in about 1.5x the integer-only time. ICL uses vector multiply code
that takes 6 clocks/modmul, so there is no time savings on the integer
side. However, a split-radix or radix-8 FGT will cut the number of complex
multiplies by 20-30%, and you can expect to see similar speedups with the
convolution itself.

I suspect that any butterfly loop or computation that involves complex
multiplies is going to need *major* assembly langauage to get optimal
performance. Heck, Compaq C for linux doesn't use 128-bit long longs, and
non-bleeding-edge gcc does not perform 21264 optimizations.

Hope this helps,
jasonp

_________________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

Reply via email to