George Woltman wrote:
> I'm working on version 19 of prime95 and I need your help.
>In the past, the exponents at which a larger FFT is used was picked
>rather haphazardly. I simply picked a few exponents trying to find
>one that could run a thousand or so iterations without the convolution
>error greatly exceeding 0.3.
>
> For this release, I thought it would be nice to use a more
>scientific approach. What I've done is run almost 10000 iterations
>of an exponent and used Excel to graph the number of iterations with
>an error between 0.200 and 0.201, 0.201 and 0.202, etc. As you can imagine
>you get a Bell-like curve that peaks at some value.
Numerical Recipes (I believe both 1st and 2nd editions) has a description
of the use of the FFT for multiplication of large integers - while their
actual implementation is awful and their error estimates are flawed (see
below), it's a useful starting point. I quote directly from Section 20.6
of the 2nd edition, "Arithmetic at arbitrary precision" (my annotations
in {}):
"In [Section] 13.1 we learned how to compute the convolution of two
vectors by the fast Fourier transform (FFT): Each vector is FFT'd,
the two complex transforms are {dyadically, or pointwise} multiplied,
and the result is inverse-FFT'd. Since the transforms are done with
floating arithmetic, we need sufficient precision so that the exact
integer value of each component {digit} of the result is discernible
in the presence of roundoff error. We should therefore allow a
(conservative) few times log_2(log_2 N) bits for roundoff in the FFT.
A number of length N bytes in radix 256 {NR's preferred base for
arbitrary-length integer arithmetic} can generate convolution components
as large as the order of (256)^2 N, thus requiring 16 + log_2 N bits of
precision for exact storage.
If [it] is the number of bits in the floating mantissa (cf. [Section]
20.1), we obtain the condition
16 + log_2 N + few x log_2 log_2 N < it (20.6.3)
We see that single precision, say with [it] = 24, is inadequate for
any interesting value of N, while double precision, say with [it] = 53,
allows N to be greater than 10^6, corresponding to some millions of
decimal digits."
Alas, careful analysis puts a few dents in this tidy framework - let's review.
First: for a random-walk type error behavior, i.e. error roughly proportional
to the square root of the number of correlated floating operations, the FFT
having O(N log_2 N) of these, the accumulated error is O(sqrt(N log_2 N)),
corresponding to O(log_2 sqrt(N log_2 N)) = O(log_2 N + log_2 log_2 N)
contaminated bits, the square root having been absorbed as a factor of 1/2
into the implied constant of proportionality. Thus, the NR estimate of
log_2(log_2 N) bits for roundoff in the FFT is asymptotically too small,
which perhaps explains why they require "a few times" log_2(log_2 N)
for their estimate to work, even though the constant of proportionality
predicted by the FFT opcount is O(1).
Second: NR use a strictly nonnegative-digit representation for their vectors,
i.e. digits in [0, radix-1], and an FFT of such indeed can generate very large
convolution components - e.g. the first result element, which is simply the sum
of all N input digits, is guaranteed to be O(radix*N) large for a random string
of such digits. The average digit size, on the other hand, follows a random-walk
behavior, i.e. is O(radix*sqrt(N)). If we divide by the sqrt(N) factor included
in the definition of the DFT (or we can divide by N after the inverse FFT - it
doesn't matter which), our average digit size is now still O(radix). The dyadic
(pointwise) product then bumps the average size of the terms up to O(radix^2).
The inverse FFT takes these and produces average digit size O(radix^2*sqrt(N)),
but another divide by sqrt(N) brings that down to O(radix^2) again. Upon taking
the logarithm, this translates to O(2*log_2 radix) error bits.
Note: when we instead adopt a balanced-digit representation (as advocated by
Crandall et al.), i.e. digits in [-radix/2, +radix/2], then due to random
cancellations the convolution components, while still following a random-walk
behavior, are generally smaller than those obtained in nonnegative-digit form.
It is not clear whether the superior accuracy of balanced-digit form is due
more to this average effect or to the fact that there are no longer any selected
terms (e.g. the purely additive first FFT element) which are guaranteed to
be large, as occurs using nonnegative-digit form.}
Now, using the refined error estimate described above, allowing for a general
radix and for a fudge factor C to multiply the FFT error bits (this lumps
together the FFT opcount constant of proportionality, the fact that we're
doing not one but two FFTs and the fact that certain operations, e.g. an
occasional subtract of nearly equal numbers or add of disparately sized
numbers, incur a much larger than average roundoff), we get
2*log_2(radix) + C*[log_2 N + log_2 log_2 N] < it. (20.6.E)
Now let's extract some numbers from
the above. First we need to calibrate the constant C - using IEEE double
precision, [it] = 53, and N = 2^12, I can get around 22 bits per float
before roundoff errors hose my calculation. Thus, I get
44 + C*[12 + 3.58...] ~= 53, or C ~= 0.577.
This predicts that for FFT length 256K (N = 2^18), I should be able to use
up to radix 2^20, i.e. test exponents up to 20*2^18 = 5242880, which is
quite close - I can get up to around 5200000 in practice.
(The NR equation on the other hand, when written as
2*log_2(radix) + log_2 N + C*log_2 log_2 N < it
gives for [it] = 53, N = 2^12 and radix = 2^22 that C is a negative number,
which is clearly unreasonable.)
Since these are asymptotic estimates and we are interested in runlengths
approaching 2^20, we should actually use the large-exponent data to calibrate
the inequality. For my code, I set log_2(radix) = 5200000/2^18 = 19.836...,
and since N=2^18, this gives C = 0.6011. (give or take :)
Now let's look at George's data, which will be slightly different due
to Intel 80-bit register type (which is in effect only while data are
in registers). http://www.mersenne.org/status.htm says that Prime95
can test up to p = 5260000 using N = 2^18. This gives C = 0.5805. Now
plugging N=2^20 in, the formula (20.6.E) gives log_2(radix) = 19.4407...,
and predicts that for vector length 1024K, Prime95 should be able to
test up to p ~= 20385000. The above page claims 20500000, so that is
quite close.
The actual distribution of fractional errors in the rounding phase
(George's "bell curve") is also a very interesting topic, but that's
another chapter. The above, calibrated using observed behavior,
will at least allow automation of the setting of FFT length breakpoints.
-Ernst
________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm