Mersenne: Re: fast GCD
George Woltman writes: Since F24 is 5 million digits your computer would take 5 minutes * 5^2, or 2 hours. Given your machine is faster, my code is 32-bit, my code does an extended GCD, the Alpha is a better architecture, and your code is probably better - you come up with the factor of 12 difference. Was your code doing a true Fermat-mod DWT, or were you attempting to factor instead the Mersenne number M(2^25-1) = (2^(2^24)-1)*(2^(2^24)+1)? In the latter case your GCD would involve vectors twice as long, which would mean a factor-of-4 runlength increase for an O(n^2) algorithm. Now the good news. Crandall's giants code has an implementation of the O (n * log n * log n) GCD in it. It was painfully slow. After two days of optimizing, I will be able to do that 5 million digit GCD in an estimated 30 minutes. Impressive - can you give us a brief summary of the most time-impacting changes you made, and were these to the C source or in ASM? BTW, I know of no other publicly available implemetation of the fast GCD. It was written by J. P. Buhler. I know little about it, except that Crandall mentioned it took them the better part of 6 months to get the thing written and debugged. Better them than you, right? :) Jason Papadopoulos writes: I've heard lots of complaints that IA64 has no fully-integer multiply (i.e. you have to load your numbers into an FPU register and pull the result back). Actually, there have been a lot of gripes about IA64; Intel's developers' guide is great reading but quite difficult at times. Respectfully, I believe this is a point on which the purists may be plain wrong. One thing that impresses me about the IA64 (at least judging from the hardware guide) is the flexibility regarding loading integers into either a 64-bit integer (a.k.a. general-purpose) register or into a floating register mantissa - sure, the IA64 has no separate integer multiply unit, but if it can quickly load two 64-bit ints into floating registers and then use the floating multiply unit (which is fully pipe- lined and very fast) to get either the low or high 64 bits of the 128-bit (exact integer) product, then that's a real gain. One thing about the above operation (specifically the upper 64-bit calculation) that intrigues me is how the IA64 effects error correction of the floating product. On an IEEE-compliant machine, one can use an FMUL to get the upper 52 bits of an integer product of up to 52+64=116 bits in length, a standard 64-bit integer MUL to get the low 64 bits, and one uses the remaining bit (the lowest-order bit of the 53-bit floating mantissa) to compare to the corresponding bit in the 64-bit integer product to effect error correction. Perhaps the IA-64 uses a 65-th mantissa bit in the floating multiplier to do this? Cheers, -Ernst ftp://209.133.33.182/pub/mayer/home.html _ Unsubscribe list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: Re: fast GCD
Hi, At 04:07 PM 11/6/99 EST, [EMAIL PROTECTED] wrote: Was your code doing a true Fermat-mod DWT, or were you attempting to factor instead the Mersenne number M(2^25-1) = (2^(2^24)-1)*(2^(2^24)+1)? Prime95 uses a true Fermat-mod DWT. Impressive - can you give us a brief summary of the most time-impacting changes you made, and were these to the C source or in ASM? There were a few things in giants that were not done efficiently. For example, dividing one huge number by another was done using reciprocals and then multiplying. Well, 99.9% of the time in a GCD the quotient will be less than 2^32 and can be derived by one floating point divide on just the high order bits. It was also a bit of a memory hog. I also upgraded giants from 16-bit to 32-bit, made use of some ASM helper routines, cleaned up memory management, etc. The changes I've made have been forwarded to Dr. Crandall for possible incorporation into a future release of giants. BTW, I know of no other publicly available implemetation of the fast GCD. It was written by J. P. Buhler. I know little about it, except that Crandall mentioned it took them the better part of 6 months to get the thing written and debugged. Better them than you, right? :) Absolutely! And it is quite elegant. You should be able to incorporate it into your code with little trouble. I had little trouble upgrading it to do modular inverses too. Regards, George _ Unsubscribe list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Mersenne: Re: fast GCD
Alex Kruppa wrote: Can this be used to do a LL test and Pollard-rho factoring attempt at once? Jason Papadopoulos wrote: Except that every once in a while you'd have to perform a GCD, and a GCD on million-digit size numbers takes about a day (no kidding! There seems to be no fast way to do one!) A day seems somewhat of an overestimate (though maybe it takes that long on a Pentium - I don't know.) On a decently fast Alpha, say a 500MHz 21164, I can do a million-digit (and note I mean base-10 digits, not bits) GCD in under 5 minutes using a simple O(n^2) Lehmer-type implementation with scalar multipliers of around 63 bits - one gets similar performance on MIPS after adjusting for the typically lower clock rates. (Things should be 2-3x faster on Alpha 21264, due its much better integer multiply The code in question is my own (and will be publically available when I release my Mersenne p-1 code), but Torbjorn Granlund sent me timings that indicate the GMP library's GCD is similarly fast on such hardware. The Alpha implementation uses the UMULH isntuction to get the upper 64 bits of the 64x64==128-bit unsigned integer products which arise when multiplying the digits of a base-2^64 multiword integer by a scalar of up to 64 bits in size, and the MIPS version uses DMULTU to get the full 128-bit product. On CPUs supporting 64-bit integers and floats but lacking such 128-bit multiply instructions, one can emulate UMULH via a floating multiply, using scalars up to 52 bits in size (we need to save one of the mantissa bits to effect error correction of the FMUL result.) On the Intel IA-64, there is hardware support for a (now fully 64-bit) UMULH-like instruction in the floating multiplier (also used to get the low 64 bits of the product.) One can in theory do similar on the Pentium, but for this to be efficient, one needs to be able to load a 64-bit int directly into the 64-bit mantissa of a floating-point register - I don't know enough about the x86 to say whether one can do such an operation on current CPUs. I think George uses similar tricks to do the speedy Prime95 sieve-based factoring on the Pentium, so it seems it should be possible. Lastly, note that one doesn't need to do many GCDs if one is running, say, a p-1 or ECM factoring algorithm - for large numbers, one could restrict oneself to to doing a single GCD at the end of stage 1 and perhaps every day or so in stage 2. This makes a "fast" (i.e. asymptotically faster than O(n^2)) GCD less pressing. Alex wrote: Thats true (btw, I keep hearing of a n*log(n) gcd - how long would that take then? Where can I read up on it - Knuth vol.2 ?) Richard Candall claims the fastest he's heard of is O(n*log^2(n)), i.e. a factor of log(n) slower order than a length-n FFT, though for n 1, this should still be much faster than O(n^2). The problem is that it's likely highly nontrivial to code - using a well-tested library (e.g. GMP) to effect the basic operations would probably be one's best bet if one doesn't want to spend months coding it all oneself. Peter Montgomery's PhD thesis mentions a fast GCD due to Moenck, but I don't know what the precise work estimate is for that algorithm - Peter? Other GCD-related musings I've been mulling over: 1) Is there a fast (e.g. O(n*long(n) or such) iterative method to generate a modular inverse modulo a Mersenne or Fermat number? Of course one can use an extended GCD algorithm to get the inverse, but I had in mind something more elegantly, which ideally would allow fast DWT-based FFT arithmetic to be used, i.e. which avoids having to convert an n-bit mixed-radix DWT floating-point factoring residue to a fixed-radix integer, do the EGCD, then convert the result back to floating form in preparation for more DWT-style arithmetic. This question was inspired by the fact that one can use a modular analog of the well-known Newton iterative scheme for multiplicative inverses to get inverses modulo some desired power of 2. For example, if x is some integer and y_n is the multiplicative inverse of x modulo, say, 2^16, then one obtains the mod-2^32 inverse via y_{n+1} = y_n * (2 - x*y_n), where all intermediate arithmetic is done modulo 2^32. Alas, this scheme doesn't work except for moduli which are powers of 2. 2) If a fast iteration as described in (1) did exist, could one use it to obtain a GCD in similar time? Cheers, -Ernst _ Unsubscribe list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers