The usual concept of an integer FFT is to choose a prime q such that q-1
is divisible by a reasonably large power of 2, say N = 2^n, find a
primitive (integer) N-th root of 1 mod q, say w, then use w and
arithmetic mod q to calculate the FFT. If it also happens that there is
an integer N-th root of 2 mod q, then the FFT can be converted into a
DWT suitable for implementing multiplication mod 2^p - 1 for some large
prime p. For example, with q = 2^64 - 2^32 + 1, we have both roots for N
up to 2^26, and for that matter we can also use an FFT/DWT of order N =
5*2^n up to 5*2^26. However, this requires calculating products of the
form a*b mod q where a and b are 64-bit integers, which, while
straightforward, will probably take more time than the corresponding
complex floating-point multiply in the ordinary FFT/DWT. It is also hard
to find appropriate primes q for which this works.

Richard Crandall has proposed implementing an FFT in the Galois field
GF(q^2), where q is itself a Mersenne prime, e.g. q = 2^61 - 1. He calls
this the fast Galois transform (FGT). The idea is that for any prime q =
3 mod 4, the field of Gaussian integers a + b*i mod q is isomorphic to
GF(q^2), thus we can simply replace complex real values with complex
integers mod q. The multiplicative order of GF(q^2) is q^2-1, thus for q
= 2^61 - 1, q^2 - 1 will be divisible by 2^62, thus we can find a
complex integer w such that w^2^62 = 1. Also, since the order of 2 in
GF(q^2) is 61, there will also be a value r such that r^2^62 = 2. We can
use r and w to implement a complex integer DWT mod q, which requires
code to add, subtract and multiply mod q. However, we still have the
problem that the calculation of a*b mod q is likely to take more time
than we would really like.

I have a suggestion. The FGT requires only a prime q = 3 mod 4, for
which q+1 is divisible by a reasonably large power of 2. Let's consider
primes q = k*2^n - 1 which are slightly less than 2^32. The idea here is
that this will require only 32-bit integer arithmetic. If we use two
such primes, q1 and q2, and we calculate separately the two sets of
convolution sums z1[0..N-1] mod q1 and z2[0..N-1] mod q2, then we can
use the Chinese Remainder Theorem to calculate the convolution sums
z[0..N-1] mod q1*q2. This is less difficult than it might seem, since we
can precalculate values u1 and u2 such that u1 = 1 mod q1, u1 = 0 mod
q2, u2 = 0 mod q1, u2 = 1 mod q2. Then if we have, from the two FGT
calculations, values z1[j] and z2[j] such that z[j] = z1[j] mod q1 and
z[j] = z2[j] mod q2, we have z[j] = u1*z1[j] + u2*z2[j] mod q1*q2, which
is easily verified to be correct. This calculation can be further
simplified because of the fact that u1 + u2 = 1 mod q1*q2 (in fact, u1 +
u2 = q1*q2 + 1). Thus, (u1+u2)*(z1[j]+z2[j]) = z1[j]+z2[j] mod q1*q2,
and z[j] = u1*z1[j] + u2*z2[j] = (z1[j] + z2[j] + (u1-u2)*(z1[j]-z2[j]))
/ 2 mod q1*q2. This requires only a single multiply
(u1-u2)*(z1[j]-z2[j]) mod q1*q2, together with some shifts and adds mod
q1*q2.

The rationale for this is that the efficacy of an integer FFT depends on
the size of the modulus. If the modulus is slightly less than 2^32, then
for N in the range used by GIMPS we may be able to use only 6-bit
coefficients, whereas with a modulus slightly less than 2^64, in this
case q1*q2, we will be able to use 22-bit coefficients, thereby
increasing the range of validity by a factor of nearly 4, in exchange
for using two FGTs instead of one and the Chinese Remainder step at the
end of each iteration.

It is worth noting also, in the particular case of the Pentium, that
there is an efficient method of calculating a*b mod q, for q < 2^32,
using the floating-point unit. This is nice because FMUL is faster than
MUL on the Pentium, and because we can interleave integer adds and
subtracts mod q with floating-point multiplies mod q.

Here is a concrete example for N = 3*2^18:

                      N-th root of 1      N-th root of 2
  q1 4293525503  2908044543+2957159114*i    960683048
  q2 4292083711  4225761219+3412039801*i   1768092337

With these values, we can construct an FGT for both q1 and q2. To obtain
the convolution sum z[j] from z1[j] and z2[j], we have

  u1 = 11727005047594504564
  u2 =  6701165826594877070
  q1*q2 = 18428170874189381633
  u1-u2 =  5025839220999627494

We then multiply (u1-u2)*(z1[j]-z2[j]), which is the product of a signed
63-bit number by a signed 32-bit number. It is best to save the signs of
u1-u2 and z1[j]-z2[j] and use their absolute values, since a 64x32
unsigned multiply can be implemented with two 32x32 unsigned multiplies
to produce a 96-bit unsigned result. Reducing this mod q1*q2 is a bit
messy but doable; essentially, we want to use the FPU to calculate an
estimate of floor(product/q1*q2), then reduce product to product -
floor(product/q1*q2)*(q1*q2), then if necessary iterate until the result
is reduced mod q1*q2. Then, depending on the saved signs, we either add
or subtract this product from z1[j]+z2[j] mod q1*q2 and halve it, again
mod q1*q2, to obtain z[j].

Note that, as is the case for the complex floating-point DWT, the N-th
root of 2 for the FGT is a real number. Thus, scaling by powers of the
N-th root of 2 requires multiplying a real integer by a complex integer,
which is more efficient than a full complex multiplication.


Regards,

Bill
**********************************************************************
This email and any files transmitted with it are confidential and 
intended solely for the use of the individual or entity to whom 
they are addressed. 
This footnote also confirms that this email message has been swept by 
MIMEsweeper for the presence of computer viruses.
**********************************************************************

Reply via email to