Mersenne: Re: fast GCD

1999-11-06 Thread EWMAYER

George Woltman writes:

Since F24 is 5 million digits your computer would take 5 minutes * 5^2,
or 2 hours.  Given your machine is faster, my code is 32-bit, my code
does an extended GCD, the Alpha is a better architecture, and your code
is probably better - you come up with the factor of 12 difference.

Was your code doing a true Fermat-mod DWT, or were you attempting to
factor instead the Mersenne number M(2^25-1) = (2^(2^24)-1)*(2^(2^24)+1)?
In the latter case your GCD would involve vectors twice as long, which
would mean a factor-of-4 runlength increase for an O(n^2) algorithm.

Now the good news.  Crandall's giants code has an implementation of
the O (n * log n * log n) GCD in it.  It was painfully slow.  After two
days of optimizing, I will be able to do that 5 million digit
GCD in an estimated 30 minutes.

Impressive - can you give us a brief summary of the most time-impacting
changes you made, and were these to the C source or in ASM?

BTW, I know of no other publicly available implemetation of the fast GCD.
It was written by J. P. Buhler.

I know little about it, except that Crandall mentioned it took them
the better part of 6 months to get the thing written and debugged.
Better them than you, right? :)

Jason Papadopoulos writes:

I've heard lots of complaints that IA64 has no fully-integer multiply
(i.e. you have to load your numbers into an FPU register and pull the
result back). Actually, there have been a lot of gripes about IA64;
Intel's developers' guide is great reading but quite difficult at times.

Respectfully, I believe this is a point on which the purists may be plain
wrong. One thing that impresses me about the IA64 (at least judging from
the hardware guide) is the flexibility regarding loading integers into
either a 64-bit integer (a.k.a. general-purpose) register or into a
floating register mantissa - sure, the IA64 has no separate integer
multiply unit, but if it can quickly load two 64-bit ints into floating
registers and then use the floating multiply unit (which is fully pipe-
lined and very fast) to get either the low or high 64 bits of the 128-bit
(exact integer) product, then that's a real gain.

One thing about the above operation (specifically the upper 64-bit
calculation) that intrigues me is how the IA64 effects error correction
of the floating product. On an IEEE-compliant machine, one can use an
FMUL to get the upper 52 bits of an integer product of up to 52+64=116
bits in length, a standard 64-bit integer MUL to get the low 64 bits,
and one uses the remaining bit (the lowest-order bit of the 53-bit
floating mantissa) to compare to the corresponding bit in the 64-bit
integer product to effect error correction. Perhaps the IA-64 uses a 
65-th mantissa bit in the floating multiplier to do this?

Cheers,
-Ernst

ftp://209.133.33.182/pub/mayer/home.html

_
Unsubscribe  list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers



Re: Mersenne: Re: fast GCD

1999-11-06 Thread George Woltman

Hi,

At 04:07 PM 11/6/99 EST, [EMAIL PROTECTED] wrote:
Was your code doing a true Fermat-mod DWT, or were you attempting to
factor instead the Mersenne number M(2^25-1) = (2^(2^24)-1)*(2^(2^24)+1)?

Prime95 uses a true Fermat-mod DWT.

Impressive - can you give us a brief summary of the most time-impacting
changes you made, and were these to the C source or in ASM?

There were a few things in giants that were not done efficiently.
For example, dividing one huge number by another was done using 
reciprocals and then multiplying.  Well, 99.9% of the time in a GCD
the quotient will be less than 2^32 and can be derived by one floating
point divide on just the high order bits.  It was also a bit of a
memory hog.

I also upgraded giants from 16-bit to 32-bit, made use of some
ASM helper routines, cleaned up memory management, etc.

The changes I've made have been forwarded to Dr. Crandall for
possible incorporation into a future release of giants. 

BTW, I know of no other publicly available implemetation of the fast GCD.
It was written by J. P. Buhler.

I know little about it, except that Crandall mentioned it took them
the better part of 6 months to get the thing written and debugged.
Better them than you, right? :)

Absolutely!  And it is quite elegant.  You should be able to incorporate
it into your code with little trouble.  I had little trouble upgrading
it to do modular inverses too.

Regards,
George

_
Unsubscribe  list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers



Mersenne: Re: fast GCD

1999-10-28 Thread EWMAYER

Alex Kruppa wrote:

 Can this be used to do a LL test and Pollard-rho factoring attempt at
 once?

Jason Papadopoulos wrote:

 Except that every once in a while you'd have to perform a GCD, and
 a GCD on million-digit size numbers takes about a day (no kidding!
 There seems to be no fast way to do one!)

A day seems somewhat of an overestimate (though maybe it takes that
long on a Pentium - I don't know.) On a decently fast Alpha, say a 500MHz
21164, I can do a million-digit (and note I mean base-10 digits, not bits)
GCD in under 5 minutes using a simple O(n^2) Lehmer-type implementation
with scalar multipliers of around 63 bits - one gets similar performance
on MIPS after adjusting for the typically lower clock rates. (Things should
be 2-3x faster on Alpha 21264, due its much better integer multiply
The code in question is my own (and will be publically available when I
release my Mersenne p-1 code), but Torbjorn Granlund sent me timings that
indicate the GMP library's GCD is similarly fast on such hardware. The
Alpha implementation uses the UMULH isntuction to get the upper 64 bits
of the 64x64==128-bit unsigned integer products which arise when
multiplying the digits of a base-2^64 multiword integer by a scalar
of up to 64 bits in size, and the MIPS version uses DMULTU to get the
full 128-bit product. On CPUs supporting 64-bit integers and floats
but lacking such 128-bit multiply instructions, one can emulate UMULH via
a floating multiply, using scalars up to 52 bits in size (we need to save
one of the mantissa bits to effect error correction of the FMUL result.)

On the Intel IA-64, there is hardware support for a (now fully 64-bit)
UMULH-like instruction in the floating multiplier (also used to get the
low 64 bits of the product.) One can in theory do similar on the Pentium,
but for this to be efficient, one needs to be able to load a 64-bit int
directly into the 64-bit mantissa of a floating-point register - I don't
know enough about the x86 to say whether one can do such an operation
on current CPUs. I think George uses similar tricks to do the speedy
Prime95 sieve-based factoring on the Pentium, so it seems it should be
possible.

Lastly, note that one doesn't need to do many GCDs if one is running, say,
a p-1 or ECM factoring algorithm - for large numbers, one could restrict
oneself to to doing a single GCD at the end of stage 1 and perhaps every
day or so in stage 2. This makes a "fast" (i.e. asymptotically faster
than O(n^2)) GCD less pressing.

Alex wrote:

Thats true (btw, I keep hearing of a n*log(n) gcd - how long would that
take then? Where can I read up on it - Knuth vol.2 ?)

Richard Candall claims the fastest he's heard of is O(n*log^2(n)), i.e.
a factor of log(n) slower order than a length-n FFT, though for n  1,
this should still be much faster than O(n^2). The problem is that it's
likely highly nontrivial to code - using a well-tested library (e.g. GMP)
to effect the basic operations would probably be one's best bet if one
doesn't want to spend months coding it all oneself.

Peter Montgomery's PhD thesis mentions a fast GCD due to Moenck, but
I don't know what the precise work estimate is for that algorithm - Peter?

Other GCD-related musings I've been mulling over:

1) Is there a fast (e.g. O(n*long(n) or such) iterative method to generate
a modular inverse modulo a Mersenne or Fermat number? Of course one can
use an extended GCD algorithm to get the inverse, but I had in mind
something more elegantly, which ideally would allow fast DWT-based FFT
arithmetic to be used, i.e. which avoids having to convert an n-bit
mixed-radix DWT floating-point factoring residue to a fixed-radix integer,
do the EGCD, then convert the result back to floating form in preparation
for more DWT-style arithmetic. This question was inspired by the fact that
one can use a modular analog of the well-known Newton iterative scheme
for multiplicative inverses to get inverses modulo some desired power of 2.
For example, if x is some integer and y_n is the multiplicative inverse of
x modulo, say, 2^16, then one obtains the mod-2^32 inverse via

 y_{n+1} = y_n * (2 - x*y_n),

where all intermediate arithmetic is done modulo 2^32. Alas, this scheme
doesn't work except for moduli which are powers of 2.

2) If a fast iteration as described in (1) did exist, could one use it
to obtain a GCD in similar time?

Cheers,
-Ernst

_
Unsubscribe  list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers