Dienst writes (in response to Ernst Mayer)
> > If one instead does a fast polynomial-evaluation stage 2, even one
> > with modest circular convolution length (to keep memory requirements
> > reasonable), the stage 2 depth for a given amount of work should rise
> > dramatically, perhaps 2-3 orders of magnitude. One needs to do an
> > extended GCD at the start of stage 2 to generate the modular inverse
> > of the stage 1 residue, but this costs no more than an additional
> > regular GCD - in fact, one can do the regular and EGCD at the end
> > of stage 1 together.
>
> Errm, 2-3 orders of magnitude seems far too high an
> improvement. You are not going to want to do too much work in stage 2
> and with short convolutions (at 2 MB per number the convolutions will
> be mighty short) the gains will not be near that great.
Using a two-dimensional FFT, the time to multiply two bivariate
polynomials (degree nx-1 in X, degree ny-1 in Y) modulo X^nx - 1 and Y^ny - 1
is O(nx*ny*log(nx*ny)) = O(nx*ny*(log(nx) + log(ny))).
Suppose the Mersenne multiplications are length 262144.
We want to multiply two polynomials of degree 31 with
Mersenne coefficients, reducing the product modulo X^32 - 1
(32 is the `short' convolution length).
The time is about 262144*32*(18 + 5).
The time for a single Mersenne multiplication is about 262144*18.
The time for a convolution of length 32 is about
32*(18 + 5)/18 ~= 41 times that for a single multiplication.
The classical method for polynomial multiplication would
have a ratio of 1024, and Karatsuba a ratio of 243.
Yes, the convolution size is limited by our memory space.
But memories are growing, and the code may not be ready for general
use until 2000.