A few more iterations? (was Re: Mersenne: Moo?)

BJ . Beesley Sun, 7 Mar 1999 11:48:18 -0500
[EMAIL PROTECTED] writes:

>I pointed at the Great Number and said: Because S(3021377) has this
>many bits.  Knock off a dozen digits, and it would take that many years
>just to send the data from one computer to another.
>
>I felt pretty smart.  Then he asked: if S(n) has so many bits, how does a
>desktop computer handle it?

M(3021377) has 3021377 bits (all ones) i.e. somewhat less than 400 kilobytes, so 
fitting M(3201377) into memory is not a problem for modern PCs.

However, we start off with a 3-bit number (4) and perform 3021375 squaring 
operations on it. Now squaring a number either doubles its length, or results in a 
number double the length less 1 (irrespective of base). The "subtract 2" operation 
has little effect on the base-N representation of a large number. Therefore we can 
say that the residual obtained by LL testing M(3021377) without remaindering would 
have at least 2 to the 6 million bits. There's no way you get anywhere close to 
storing this in the entire Universe...

You can start off with e.g. 14 instead of 4 and do one less iteration, or 194 and 
save 2 iterations, irrespective of the exponent (> 4) ... but, really, the 
potential saving in time is very, very small as a proportion of the total work.

>makes what you'd like to do utterly impossible. It would ONLY be possible if
>we stored the FULL S(whatever) every cycle, and never took it modulo 2^P - 1.
>However, we'd need machines with (insert fantastically large number, perhaps
>more than the universe can handle) exabytes of RAM to store the full S for an
>exponent around 3 million. And so forth. Aw. :-(

Precisely.

However, there *is* some mileage in the idea - and a useful double-checking trick?

Suppose we take a number of exponents that all need checking (or double-checking) 
- say 3, which I'll call p, q and r. (The number doesn't matter to the theory). 
Now use the standard LL test algorithm to calculate the residues for each 
exponent, working modulo M(p)*M(q)*M(r), taking note of the values of the residues 
(the WHOLE value, not just the low n bits) S(p) at iteration p-2, S(q) at 
iteration q-2 and S(r) at iteration r-2. In other words, do the "few extra 
iterations" suggested!

If we know S(p) mod (k*M(p)) for any integer k, then obtaining S(p) mod M(p) is 
essentially trivial - one operation, and M(p) being just a string of 1 bits, it's 
very easy to do with a few (extended precision) shift & add operations.

The problem is that, superficially, this procedure doesn't seem to actually save 
any time - but it shouldn't take significantly longer than running the three 
independent tests. Practical implementation using DWT / FFT / NTT would obviously 
require a different transform size than would be used (optimally) to do the three 
tests independently. Also the data worked on would be different enough to ward off 
risks of data-dependent processor bugs.

In practice, there may be some small time saving possible. It turns out that the 
fast multiplication algorithms are most efficiently implemented when the transform 
size is a power of 2. So, e.g., numbers around 3 million are relatively expensive 
to check (using Prime95) because they're too large for a 128K transform size but 
too small to need a 256K transform size. However, checking 3 at once would fit a 
512K transform size reasonably well. There is, of course, a converse trend, in so 
far as larger transform sizes tend to cause more cache misses, therefore cache 
tuning for large transform size may become a significant issue.

My suggested "trick" applies especially to other programs e.g. MacLucasUNIX which 
have no intermediate transform sizes between powers-of-two. A couple of days ago 
my Alpha system finished double-checking M(2361701) using a 128K transform; the 
next exponent I had lined up for it was M(2361721), and it's decided that a 256K 
transform is needed, thus doubling the run time 8-( (Don't forget this is 
DIFFERENT CODE to Prime95 - in fact, the Alpha FPU works to only 53 bits precision 
(IEEE G-floating), rather than the 64 bits precision which the Intel x86 FPU is 
capable of)

Can anyone fault my logic? If not, then my suggested approach may possibly be more 
effective than writing extra code into MacLucasUNIX (and other LL testers) to cope 
with transform sizes intermediate between powers of 2. Also saves debugging a 
whole load of extra FFT code 8-)

Regards
Brian Beesley
________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
A few more iterations? (was Re: Mersenne: Moo?)

Reply via email to