Torbjorn Granlund <t...@gmplib.org> writes: > ni...@lysator.liu.se (Niels Möller) writes: > > I don't remember any fundamental difficulties. > > Except the 2^m-1 one.
If I try to remember, I think it is like this. The fft is basically polynomial multiplication mod x^{2^n} - 1 (and the variant with mod x^{2^n}+1 instead should be easy). Then to get to and from integers, we evaluate for x = 2^k, where k is same order of magnitude as the word size. This gives us integer multiplication mod 2^{k 2^n} - 1, and for most useful parameters we have 2^{k 2^n} = B^m where m is an integer and, divisible by a pretty large power of two. E.g, with B = 2^{64}, m = k 2^{n-6}. At this level, the small prime thing is an implementation detail. We choose a few small primes p_i (as well as k, the size of the pieces), such that the prime product exceeds the largest possibly coefficient in the product polynomial, which is bounded by 2^{2k} times the degree of the smallest factor, and work with images of the mappings Z[x]/(x^{2n}-1) --> Z_{p_i}[x](x^{2n}-1), and these images are multiplied efficiently using FFT over the fields Z_{p_i}. > * Tweaking the roots to compute mod 2^m+1 rather than 2^m-1 (since 2^m-1 > is better handled by splitting in mulmod_bnm1). > > Perhaps having both variants would be useful, a CRT of 2^m+1 and 2^m-1 > is particularly easy, and would perhaps result in faster general > multiply than mpn_mulmod_bnm1 approach of roughly 2^m+1, 2^{m/2}+1, > 2^{m/4}+1, ... My understanding is that if we want to compute mod 2^m-1 (even m), it's cheaper to do the two smaller computations mod 2^{m/2} \pm 1, plus the O(m) work for the CRT, than to do a single larger fft multiply. Sloppily speaking, the mulmod_bnm1 split wins because M(n) > 2 M(n/2) + O(m). If this is correct, a function for computing mod 2^m-1 using the fft directly is clearly pointless. (And then mulmod_bnm1 also uses the same split recursively, for the 2^{m/2} - 1 part, but with diminishing returns for each additional split). > * Implementing David Harvey's trick: slightly smaller primes, but > butterflies sped up from 9-10 or so cycles down to 6 (this was > benchmarked on Opteron, where 6 is the limit from multiplier > throughput). > > That indeed seems like something that ought to be done already in > version 1 of the checkin... Hopefully, the higher-level pieces of the implementation shouldn't be affected if/when we switch from 64-bit primes to 62-bits. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel