Bill Daly wrote:

> The usual concept of an integer FFT is to choose a prime q such that q-1
> is divisible by a reasonably large power of 2, say N = 2^n, find a
> primitive (integer) N-th root of 1 mod q, say w, then use w and
> arithmetic mod q to calculate the FFT. If it also happens that there is
> an integer N-th root of 2 mod q, then the FFT can be converted into a
> DWT suitable for implementing multiplication mod 2^p - 1 for some large
> prime p. For example, with q = 2^64 - 2^32 + 1, we have both roots for N
> up to 2^26, and for that matter we can also use an FFT/DWT of order N =
> 5*2^n up to 5*2^26. However, this requires calculating products of the
> form a*b mod q where a and b are 64-bit integers, which, while
> straightforward, will probably take more time than the corresponding
> complex floating-point multiply in the ordinary FFT/DWT. 

That depends on the CPU.  On StrongARM it is much quicker to do a 64*64
integer multiply because it doesn't have an FPU...

> It is also hard to find appropriate primes q for which this works.

Agreed.

> Richard Crandall has proposed implementing an FFT in the Galois field
> GF(q^2), where q is itself a Mersenne prime, e.g. q = 2^61 - 1. He calls
> this the fast Galois transform (FGT). The idea is that for any prime q =
> 3 mod 4, the field of Gaussian integers a + b*i mod q is isomorphic to
> GF(q^2), thus we can simply replace complex real values with complex
> integers mod q. The multiplicative order of GF(q^2) is q^2-1, thus for q
> = 2^61 - 1, q^2 - 1 will be divisible by 2^62, thus we can find a
> complex integer w such that w^2^62 = 1. Also, since the order of 2 in
> GF(q^2) is 61, there will also be a value r such that r^2^62 = 2. We can
> use r and w to implement a complex integer DWT mod q, which requires
> code to add, subtract and multiply mod q. 

That is a very interesting idea.  Do you have any further references?

> However, we still have the problem that the calculation of a*b mod q is
> likely to take more time than we would really like.
> 
> I have a suggestion. The FGT requires only a prime q = 3 mod 4, for
> which q+1 is divisible by a reasonably large power of 2. Let's consider
> primes q = k*2^n - 1 which are slightly less than 2^32. The idea here is
> that this will require only 32-bit integer arithmetic. If we use two
> such primes, q1 and q2, and we calculate separately the two sets of
> convolution sums z1[0..N-1] mod q1 and z2[0..N-1] mod q2, then we can
> use the Chinese Remainder Theorem to calculate the convolution sums
> z[0..N-1] mod q1*q2. This is less difficult than it might seem, since we
> can precalculate values u1 and u2 such that u1 = 1 mod q1, u1 = 0 mod
> q2, u2 = 0 mod q1, u2 = 1 mod q2. Then if we have, from the two FGT
> calculations, values z1[j] and z2[j] such that z[j] = z1[j] mod q1 and
> z[j] = z2[j] mod q2, we have z[j] = u1*z1[j] + u2*z2[j] mod q1*q2, which
> is easily verified to be correct. This calculation can be further
> simplified because of the fact that u1 + u2 = 1 mod q1*q2 (in fact, u1 +
> u2 = q1*q2 + 1). Thus, (u1+u2)*(z1[j]+z2[j]) = z1[j]+z2[j] mod q1*q2,
> and z[j] = u1*z1[j] + u2*z2[j] = (z1[j] + z2[j] + (u1-u2)*(z1[j]-z2[j]))
> / 2 mod q1*q2. This requires only a single multiply
> (u1-u2)*(z1[j]-z2[j]) mod q1*q2, together with some shifts and adds mod
> q1*q2.
> 
> The rationale for this is that the efficacy of an integer FFT depends on
> the size of the modulus. If the modulus is slightly less than 2^32, then
> for N in the range used by GIMPS we may be able to use only 6-bit
> coefficients, whereas with a modulus slightly less than 2^64, in this
> case q1*q2, we will be able to use 22-bit coefficients, thereby
> increasing the range of validity by a factor of nearly 4, in exchange
> for using two FGTs instead of one and the Chinese Remainder step at the
> end of each iteration.

I looked into doing something similar for the StrongARM.  Not with a FGT but
with 2 or 3  ordinary integer 32 bit FFTs followed by a CRT step.  I came to
the conclusion that it wasn't worth it.

The main cost of the FFT isn't the time to multiply it is the time to get
things in and out of memory.  I wouldn't have thought a 32 bit FFT would be
much more than twice the speed of a 64 bit FFT and I would expect any
advantage would get chewed up by the CRT.  That said you can combine the CRT
with other bits of the DWT I expect so maybe it wouldn't be that bad.  On the
Pentium it may be a different story entirely.

One particular problem with StrongARM is not having a divide instruction it
makes the modulo arithmetic rather tricky for arbitrary q and there aren't
that many good q's below 2^32.

I like very much the idea of using the FGT with 2^61-1 though!
-- 
Nick Craig-Wood
[EMAIL PROTECTED]
http://www.axis.demon.co.uk/

Reply via email to