>
> From: Nick Craig-Wood <[EMAIL PROTECTED]>
> Date: Mon, 26 Oct 1998 23:03:55 +0000 (GMT)
> Subject: Re: Mersenne: AMD K7 will
>
> Foghorn Leghorn wrote:
>
> > > The Playstation doesn't do floating point. It would be a very slow
> > > platform for running GIMPS. :)
> >
> > Question: does Prime95 use floating point because the algorithm requires
> > it, or because something about the Intel architecture makes it a good
> > choice?
>
> The algorithm doesn't require it but on most processors it is quickest to do
> a floating point discrete weighted transform.
>
> That said I've written (with quite a bit of help from Peter LM) a DWT for
> StrongARM which is runs without use of floating point (StrongARM doesn't have
> an FPU).
>
> The individual computations are a bit more expensive because of the need to
> work modulo a 64 bit prime so this makes the StrongARM routine about 1/3 of
> the speed of an equivalently clocked Pentium.
>
> An Integer DWT has a few advantages though, namely because it doesn't have
> round off errors it can pack a few more bits in and use slightly smaller FFT
> sizes.
>
<snip>
This is very interesting. I did some experiments using q = 2^64 - 2^32 +
1 to see how a multiply mod q would stack up against a complex floating
point multiply, and I found that on a 266Mhz Pentium II the
multiplication mod q was about 15% faster. Unfortunately, I have lost
both the code and the results (my laptop was stolen), but they are
easily reconstructed. I didn't follow up on it because I couldn't see
how to integrate the DWT with an integer FFT, but if you have found a
way, then maybe it is in fact possible to implement an integer version
of LL which is competitive with prime95.
The particular form of q is what makes reduction mod q efficient. When
you use a 32x32 multiply to implement a 64-bit multiply, the result is
of the form a*2^96 + b*2^64 + c*2^32 + d, then since 2^96 = 1 mod q and
2^64 = 2^32 - 1 mod q, the result can be reduced mod q to (b+c)*2^32 +
(a+d-b), which requires only 32-bit addition and careful attention to
the carry bit. The 64-bit multiply itself requires 3 32-bit multiplies,
and by comparison a complex floating point multiply requires 3 floating
point multiplies. The floating point multiply on the Pentium is
(curiously) faster than the 32-bit integer multiply, but the total time
involved in multiplication is a relatively small percentage of the total
time required, since both multiplies are quite fast.
It is of course fortuitous that q is prime.
Could you share the knowledge of how to adapt the DWT to an integer FFT?
Regards,
Bill