> From: "Olivier Langlois" <[EMAIL PROTECTED]>
> Date: Wed, 6 Jan 1999 22:31:24 -0500
> Subject: Mersenne: Questions on Crandall algorithm
> 
> Hi,
> 
> I'm trying to understand Crandall's algorithm but I'm not sure to get the
> idea behind it. So let me explain you what I've understand and I would
> appreciate it if you could tell me if I got it right.
> <snip>

Let's consider a simple example. Suppose that we have two values

  X = x0 + x1*2^4 + x2*2^8 + x3*2^12
  Y = y0 + y1*2^4 + y2*2^8 + y3*2^12

and we multiply them to obtain

  Z = X*Y = z0 + z1*2^4 + z2*2^8 + z3*2^12 + z4*2^16 + z5*2^20 + z6*2^24

Then we have

  z0 = x0*y0
  z1 = x0*y1 + x1*y0
  z2 = x0*y2 + x1*y1 + x2*y0
  z3 = x0*y3 + x1*y2 + x2*y1 + x3*y0
  z4 = x1*y3 + x2*y2 + x3*y1
  z5 = x2*y3 + x3*y2
  z6 = x3*y3

Sums of this form are called convolution sums, and the FFT essentially
provides a very efficient way to calculate the values of convolution
sums.

When we implement the FFT using complex floating-point arithmetic, the
values obtained are only approximations to the correct convolution sums,
because of the inevitable round-off error which builds up during the
calculation. If, however, we know that the coefficients in X and Y are
integers, then we expect the calculated convolution sums to be close to
integers, and if they are sufficiently close, we can simply round the
calculated sums to the nearest integer to recover the exact values. This
will work provided that the maximum error in any calculated convolution
sum is less than 0.5. On the Pentium, for example, floating-point values
are accurate to 53 bits, thus we might expect that in the calculation of
z3, we might accumulate a relative error of at most 3 bits, thus if the
integer part of z3 is less than 2^50, we should be able to round it to
the correct integer. This will be the case if the coefficients in X and
Y are less than 2^24, which is much more than adequate here, since in
fact the coefficients are limited to less than 2^4.

Now, let's modify the original setup:

  X = x0 + x1*2^5 + x2*2^9 + x3*2^13
  Y = y0 + y1*2^5 + y2*2^9 + y3*2^13

and we multiply them to obtain

  Z = X*Y = z0 + z1*2^5 + z2*2^9 + z3*2^13 + z4*2^17 + z5*2^22 + z6*2^26

Then we have

  z0 = x0*y0
  z1 = x0*y1 + x1*y0
  z2 = x0*y2 + 2*x1*y1 + x2*y0
  z3 = x0*y3 + 2*x1*y2 + 2*x2*y1 + x3*y0
  z4 = 2*x1*y3 + 2*x2*y2 + 2*x3*y1
  z5 = x2*y3 + x3*y2
  z6 = x3*y3

These values are very similar to convolution sums, except that certain
products have an extra factor of 2. Note that whereas in the original
setup, the exponents in X and Y were in arithmetic progression, in the
modified setup, they are nearly in arithmetic progression. We can put
them into arithmetic progression by replacing the powers of 2 in X, Y
and Z as follows:

  2^5 ->  (2^(3/4)) * (2^(17/4))
  2^9 ->  (2^(2/4)) * (2^(34/4))
  2^13 -> (2^(1/4)) * (2^(51/4))
  2^17 ->         1 * (2^(68/4))
  2^22 -> (2^(3/4)) * (2^(85/4))
  2^26 -> (2^(2/4)) * (2^(102/4))

so that for each term, e.g. x1*2^5, we write instead x1*2^5 =
x1*(2^(3/4))*(2^(17/4)) = x1'*2^(17/4), where x1' = x1*2^(3/4).
Multipliers like 2^(3/4) here are Crandall's two-to-the-phi values. We
then have

  X = x0' + x1'*2^(17/4) + x2'*2^(34/4) + x3'*2^(51/4)
  Y = y0' + y1'*2^(17/4) + y2'*2^(34/4) + y3'*2^(51/4)

and we multiply them to obtain

  Z = X*Y = z0' + z1'*2^(17/4) + z2'*2^(34/4) + z3'*2^(51/4) +
z4'*2^(68/4) + z5'*2^(85/4) + z6'*2^(102/4)

Then we have

  z0' = x0'*y0'
  z1' = x0'*y1' + x1'*y0'
  z2' = x0'*y2' + x1'*y1' + x2'*y0'
  z3' = x0'*y3' + x1'*y2' + x2'*y1' + x3'*y0'
  z4' = x1'*y3' + x2'*y2' + x3'*y1'
  z5' = x2'*y3' + x3'*y2'
  z6' = x3'*y3'

Since these are true convolution sums, we can use the FFT to compute
them. When we write them out with the original coefficients and the
two-to-the-phi multipliers, we get

  z0 = z0' = x0'*y0' = x0*y0
  z1*2^(3/4) = z1' = x0'*y1' + x1'*y0' = x0*y1*2^(3/4) + x1*y0*2^(3/4)
  z2*2^(2/4) = z2' = x0'*y2' + x1'*y1' + x2'*y0' = x0*y2*2^(2/4) +
x1*y1*2^(6/4) + x2*y0*2^(2/4)
  z3*2^(1/3) = z3' = x0'*y3' + x1'*y2' + x2'*y1' + x3'*y0' =
x0*y3*2^(1/4) + x1*y2*2^(5/4) + x2*y1*2^(5/4) + x3*y0*2^(1/4)
  z4 = z4' = x1'*y3' + x2'*y2' + x3'*y1' = x1*y3*2^(4/4) + x2*y2*2^(4/4)
+ x3*y1*2^(4/4)
  z5*2^(3/4) = z5' = x2'*y3' + x3'*y2' = x2*y3*2^(3/4) + x3*y2*2^(3/4)
  z6*2^(2/4) = z6' = x3'*y3' = x3*y3*2^(2/4)

thus, when we back out the two-to-the-phi multipliers, e.g. by
multiplying z1' by 2^(-3/4), we get

  z0 = x0*y0
  z1 = x0*y1 + x1*y0
  z2 = x0*y2 + 2*x1*y1 + x2*y0
  z3 = x0*y3 + 2*x1*y2 + 2*x2*y1 + x3*y0
  z4 = 2*x1*y3 + 2*x2*y2 + 2*x3*y1
  z5 = x2*y3 + x3*y2
  z6 = x3*y3

which are precisely the coefficients we are looking for. Note that when
the X and Y coefficients are integers, so are the Z coefficients, thus
again we can round off to the nearest integer to obtain the exact
result.

The upshot is that we can use the FFT not only to compute true
convolution sums, but also the near-convolution sums which arise when
the exponents in X and Y are nearly in arithmetic progression.

Note also that when we are performing the LL-test, we want the result
mod 2^q-1, which has the serendipitous effect of reducing the complexity
of the calculation, in the following way. In the above example, if we
want the result mod 2^17-1, we can write

  Z = z0 + z1*2^5 + z2*2^9 + z3*2^13 + z4*2^17 + z5*2^22 + z6*2^26
    = (z0 + z4) + (z1 + z5)*2^5 + (z2 + z6)*2^9 + z3*2^13

since 2^17 = 1, 2^22 = 2^5, etc. mod 2^17-1. This gives

  z0 + z4 = x0*y0 + 2*x1*y3 + 2*x2*y2 + 2*x3*y1
  z1 + z5 = x0*y1 + x1*y0 + x2*y3 + x3*y2
  z2 + z6 = x0*y2 + 2*x1*y1 + x2*y0 + x3*y3
  z3      = x0*y3 + 2*x1*y2 + 2*x2*y1 + x3*y0

and similarly with the coefficients with two-to-the-phi multipliers

  z0' + z4' = x0'*y0' + x1'*y3' + x2'*y2' + x3'*y1'
  z1' + z5' = x0'*y1' + x1'*y0' + x2'*y3' + x3'*y2'
  z2' + z6' = x0'*y2' + x1'*y1' + x2'*y0' + x3'*y3'
  z3'       = x0'*y3' + x1'*y2' + x2'*y1' + x3'*y0'

Sums of this form are called cyclic convolution sums, and they are in
fact precisely what the FFT calculates. In fact, in order to adapt the
FFT to ordinary multiplication, it is necessary to pad the coefficient
array with 0's so as to eliminate the unwanted terms which arise in the
cyclic convolution sums. This means we can eliminate half the data which
would otherwise be needed!

We can generalize this. Suppose that the Mersenne exponent is q, and the
FFT order is N. We are going to square a polynomial with N coefficients,
and to take advantage of the reduction mod 2^q-1, we will want the N-th
convolution sum zN to be multiplied by 2^q, which means that the base we
want to use will be 2^(q/N). Let r(i) = ceil(i*q/N), s(i) = r(i) -
i*q/N. Write X in the form

  X = x0 + x1*2^r(1) + ... + xi*2^r(i) + ... + x_(N-1)*2^r(N-1)
    = x0 + x1*2^s(1)*2^(q/N) + ... + xi*2^s(i)*2^(i*q/N) + ... +
x_(N-1)*2^s(N-1)*2^((N-1)*q/N)
    = x0' + x1'*2^(q/N) + ... + xi'*2^(i*q/N) + ... +
x_(N-1)'*2^((N-1)*q/N)

where xi' = xi*2^s(i). We can use the FFT to square this mod 2^q-1,
giving

  Z = X^2 = z0' + z1'*2^(q/N) + ... + zi'*2^(i*q/N) + ... + zN'*2^q +
... + z_(2*N-2)'*2^((2*N-2)*q/N)
    = (z0' + zN') + (z1' + z_(N+1)')*2^(q/N) + ... + (z_(N-2)' +
z_(2*N-2)')*2^((N-2)*q/N) + z_(N-1)'*2^((N-1)*q/N)
    = (z0 + zN) + (z1 + z_(N+1))*2^s(1)*2^(q/N) + ... +
z_(N-1)*2^s(N-1)*2^((N-1)*q/N)
    = (z0 + zN) + (z1 + z_(N+1))*2^r(1) + ... + z_(N-1)*2^r(N-1)

where (zi + z_(i+N)) = (zi' + z_(i+N)')*2^-s(i). Since (zi + z_(i+N)) is
presumed to be an integer, we can round the calculated value to the
nearest integer to make the result exact.

Note that the number of bits required for the coefficient xi is
r(i+1)-r(i), which will sometimes be equal to floor(q/N) and sometimes
to ceil(q/N), depending on the value of i. We thus have xi <
2^ceil(q/N), xi*xj < 2^(2*ceil(q/N)), and since there are N products of
this form in each cyclic convolution sum, zk < N*2^(2*ceil(q/N)). Thus,
the number of integer bits in zk is at most log2(N)+2*ceil(q/N).
Assuming that we have 53 bits of precision, and we assume the low-order
(say) 13 bits are unreliable, we will have 40 integer bits available for
an accurate calculation, thus with (say) N = 2^20, we can handle primes
q up to q ~= 10*N. (This is a pessimistic estimate; in practice, we can
do better.)

This is only a part of Crandall's method. He also takes advantage of a
method of packing the number to be multiplied into complex coefficients,
which reduces the order of the FFT by a factor of 2. Also, he uses
something called a "split-radix" FFT, which in some way takes advantage
of special properties of the square root of -1 to further reduce the
number of multiplies required. I haven't been able to work out how this
is done, so if someone has an explanation, maybe they could post it
here.

Regards,

Bill
**********************************************************************
This email and any files transmitted with it are confidential and 
intended solely for the use of the individual or entity to whom 
they are addressed. 
This footnote also confirms that this email message has been swept by 
MIMEsweeper for the presence of computer viruses.
**********************************************************************

Reply via email to