Olivier Langlois comments (in response to others):
> First,
> let me thank you all for your helpful explanations.
>
> > > Is there someone that could be able to explain me the reasons that are
> > > behind the 2 conditions (q/N < 32 and maximum convolution error is < 0.5
> > > for every Lucas-Lehmer iteration ) (BTW, sorry for my ignorance but what
> > > does maximum convolution error mean ?)
> >
> > The fastest way to multiply numbers is to use FFTs, and the fastest way to
> > do FFTs is with floating point math.
>
> Is this affirmation 100% true ?
> Does someone have tried to perform FFT with MMX and fixed point variables?
> The only reason I can see that wouldn't make MMX a viable solution is that
> Crandall algorithm necessitate 'double' (or 'float', I don't remember which one
> is currently used) precision.
>
> Any comments ?
Floating point arithmetic is faster than integer arithmetic
on many machines today, if one does comparable numbers of adds/subtracts
and multiplies, as in matrix multiplication and FFT algorithms.
The FFT of order n uses a primitive n-th root of unity.
The Crandall trick needs an n-th root of 2. Both of these
are available if we use complex numbers with floating point math.
Computer manufacturers are paying more attention to
fast integer multiplication now that RSA and other cryptographic
algorithms are gaining prominence. On 64-bit machines with fast
integer arithmetic, modular arithmetic is a good alternative choice,
if the modulus is chosen so the requisite roots of unity and of 2 exist.
Ernst Mayer and I are (slowly) writing a code
which will use _both_ floating point and modular arithmetic.
If the floating point arithmetic provides 40 bits of precision
in an output coefficient (using 53-bit mantissa) and
a 63-bit prime yields 62 bits of precision in these coefficients,
then we can accurately recover output coefficients up to 102 bits,
and won't need as long a convolution. We expect the code to
be very fast on the new DEC Alpha 21264, a superscalar machine
which can start one integer multiply, one floating add/subtract,
and one floating multiply every clock cycle.