Re: mpn_mulmod_bnm1

David Harvey Wed, 02 Apr 2014 17:17:29 -0700

On 03/04/2014, at 7:33 AM, Niels Möller <ni...@lysator.liu.se> wrote:

> Torbjorn Granlund <t...@gmplib.org> writes:
> 
>> ni...@lysator.liu.se (Niels Möller) writes:
>> 
>>  I don't remember any fundamental difficulties.
>> 
>> Except the 2^m-1 one.
> 
> If I try to remember, I think it is like this. The fft is basically
> polynomial multiplication mod x^{2^n} - 1 (and the variant with mod
> x^{2^n}+1 instead should be easy). 
> 
> Then to get to and from integers, we evaluate for x = 2^k, where k is
> same order of magnitude as the word size. This gives us integer
> multiplication mod 2^{k 2^n} - 1, and for most useful parameters we
> have 2^{k 2^n} = B^m where m is an integer and, divisible by a pretty
> large power of two. E.g, with B = 2^{64}, m = k 2^{n-6}.
> 
> At this level, the small prime thing is an implementation detail. We
> choose a few small primes p_i (as well as k, the size of the pieces),
> such that the prime product exceeds the largest possibly coefficient in
> the product polynomial, which is bounded by 2^{2k} times the degree of
> the smallest factor, and work with images of the mappings
> Z[x]/(x^{2n}-1) --> Z_{p_i}[x](x^{2n}-1), and these images are
> multiplied efficiently using FFT over the fields Z_{p_i}.

Hi,

This might be an appropriate juncture for me to mention an idea I had a year or 
two ago, which I have been using successfully in my small-prime FFT library, 
and which or may not be useful in the context of GMP.

The basic question is how to do products modulo 2^m - 1 (or 2^m + 1) using a 
small-prime FFT.

Assume as above that m is divisible by a sufficiently large power of 2, so you 
can split into chunks of say b bits, and thus convert to polynomial 
multiplication in Z[x]/(x^n - 1), where n is a power of 2, and where the 
coefficients have b bits. Then you choose a bunch of small primes p_1, ..., p_r 
whose product is bigger than 2b + lg(n) bits. Then you use a small-prime FFT 
modulo each p_i.

The main problem with this approach is that there is not much granularity in 
the choice of r. You don't want r to get too large, because then the 
reduction/CRT steps can become quite expensive. In fact, you would actually 
prefer r to be quite small, and r should grow very slowly as a function of m. 
(I find typically r = 4 is pretty good.) The problem in the above sketch is 
that there is a big jump in running time as soon as b crosses a certain 
threshold where you need to add another prime.

One way to improve the "smoothness" is to allow non-power-of-two transform 
lengths, e.g. add in some radix-3 or radix-5 layers at the bottom. That way you 
get more flexibility with n, and thus you can tailor b to fit exactly into the 
desired number of primes. However, this introduces new problems: (a) more 
constraints on p_i, (b) a lot more work to code up the radix-3 and radix-5 
transforms (especially if you want to do everything in assembly), and (c) the 
radix-3 and radix-5 transforms can never compete with radix-2 anyway. I tried 
this approach, but I still got big "bumps".

Another solution, which doesn't work at all, is to use TFTs. One cannot use 
TFTs directly for multiplication mod x^n - 1, because they evaluate at the 
wrong roots of unity.

So after playing around with all of these ideas for a while, I came up with a 
different solution in my own code, that I have been quite happy with.

In most algorithms calling for multiplication modulo 2^m - 1 (or 2^m + 1), such 
as fast versions of certain newton iterations, it's actually not that important 
that the modulus has exactly this form. Sometimes you really do need the 
wraparound, but often all you need is SOME modulus M of size 2^m such that 
multiplication modulo M has cost M(m/2) rather than M(m).

I propose using a modulus of the form

M = (B^{k u_1} + 1) (B^{k u_2 + 1) ... (B^{k u_s} + 1),

where

B = 2^64
k is some integer
u_1 > u_2 > ... > u_s is a decreasing sequence of powers of 2.

Then M is roughly of size B^{ku}, where u = u_1 + ... + u_s.

Example: if u = 13, then M = (B^{8k} + 1)(B^{4k} + 1)(B^k + 1).

I call such a modulus a "Fermat modulus", which is probably a terrible name, 
since I'm sure it's used for other things in the literature.

The scheme works as follows:

Suppose you want a modulus of size close to B^m.

Choose r = number of primes, any way you wish.

Choose u_1 (a power of two), which will control the overall granularity. 
(Something like u_1 = 64 works pretty well.)

Now we know u is going to be in the range u_1 <= u <= 2 u_1.

Choose a corresponding k so that multiplication modulo B^{u_1 k} + 1 can be 
done "optimally" for a small-prime FFT with r primes. Note that you have 
enormous flexibility in choosing k. You can thus ensure that b (the coefficient 
size) maps perfectly into the desired number of primes with no "wasted space".

Once we have chosen k, we can choose u_2, ..., u_s to make M pretty much 
exactly as big as we like, e.g. if u_1 = 64, then you get to choose M with less 
than 2% error.

Now there are three subroutines:

REDUCE: takes as input an arbitrary integer, and computes its residues modulo 
B^{k u_1} + 1, ..., B^{k u_s} + 1. The point is that this can be done in 
*linear* time. If I recall correctly, the cost is bounded by roughly two 
mpn_add's of size M. I don't have the energy to write out the details here, but 
the idea is basically to repeatedly use identities like B^{2k} - 1 = (B^k - 
1)(B^k + 1), and only compute the parts you need. It ends up looking very 
similar to the top layers of the TFT algorithm, working on the input in chunks 
of k limbs.

MULTIPLY: multiplies modulo B^{k u_1} + 1, ..., B^{k u_s} + 1 separately, using 
small-prime FFTs. For the largest one (B^{k u_1} + 1), the coefficients will 
fit optimally into r primes. For the remaining ones, the coefficient will be 
slightly smaller, so you would prefer to use r - epsilon primes, but this is a 
not a big deal, since they rapidly get very small.

(In my code, I have additional data structures allowing me to reuse transforms 
for each of these components.)

CRT: takes as input residues modulo B^{k u_1} + 1, ..., B^{k u_s} + 1, and 
returns the unique corresponding residue modulo M. This can also be done in 
linear time. It is only slightly slower than REDUCE. Again I am not going to 
write out the details here, but one can give an algorithm very similar to the 
inverse TFT, it looks like a bunch of cross-butterflies, operating on chunks of 
k limbs.

In my code I get very smooth performance using this algorithm. One quick 
example, where M(n) = time for n-limb multiplication, I(n) = time for n-limb 
reciprocal:

    n        M(n)/M(10^7)      I(n)/M(n).
  1.0e7         1.000             1.687
  1.1e7         1.095             1.707
  1.2e7         1.200             1.701
  1.3e7         1.300             1.695
  1.4e7         1.394             1.697
  1.5e7         1.587             1.537
  1.6e7         1.670             1.710
  1.7e7         1.779             1.703
  1.8e7         1.873             1.701
  1.9e7         1.987             1.685
  2.0e7         2.109             1.668

The multiplications use TFTs, not "Fermat moduli". The reciprocal uses a 
theoretically 5/3 M(n) multiplication algorithm, using the Fermat moduli, and 
you can see I get very very close to this theoretical result. Everything here 
is with 4 primes.

The jump in M(n) at 1.5e7 is probably due to the TFT transform size crossing a 
power of two boundary. The jump in the reciprocal at 1.6e7 is probably the same 
effect. This is happening basically because multiplication is n log n, not 
simply n. These jumps could probably be mitigated to some extent, I have not 
optimised those TFT inner loops very much.

Note also this is not toy code... at 10^7 limbs the multiplication is about 
1.5x faster than GMP 5.1.2, and the reciprocal is about 1.7x faster.

david

_______________________________________________
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel

Re: mpn_mulmod_bnm1

Reply via email to