I've now committed the new code I mentioned to the repo. I didn't actually need to change anything to get the test code to pass after all! :-)
Bill. 2010/1/3 Bill Hart <[email protected]>: > 8. What remains to be done. > > Wow, I made it all the way to section 8! This is the most important > section, where I describe what remains, in order to make this > implementation really fast, clean and worth switching out the existing > implementation in MPIR. > > I'm very interested if people would like to volunteer to work on > various projects. They range from the completely trivial, which can be > carried out by anyone, right through to things that could turn into > research projects (some of the first four things below). > > This list applies to the version of the code that I am about to commit > to the repo, which is slightly modified from the version currently in > there. Specifically I'm talking about new_fft.c, i.e. the one that > doesn't need FLINT (which I don't see any point in continuing to > update - it's there for reference only). > > Clearly the high level ideas that need the most work are: > > 1) Implement the sqrt2 trick. In the ring Z/pZ where p = 2^B + 1 the > value 2^(2B/4)-2^(B/4) is a square root of 2. This means you can have > convolutions of length 4m instead of 2m by using these extra roots of > unity. It should be the case that these are only needed in the very > top layer of the FFT and IFFT. However one should still implement > these efficiently. Obviously one decomposes a multiplication by > (sqrt(2))^{2i+1} into 2^i*sqrt(2) (and clearly sqrt(2)^{2i} just > becomes 2^i which we can already do). > > The left hand half of the FFT butterfly is unaffected, but the right > hand half will need multiplication by sqrt. One can multiply by > 2^{B/4} first, combining this with the existing butterfly shifts. Then > one addmuls by a further 2^(B/4) for the other "leg" of the sqrt2. > Whilst this is less efficient than a standard butterfly it allows one > to double the convolution length, work with smaller coefficients in > the pointwise mults and the cost is only small, at the very top layer. > > For the IFFT one must do a similar thing, decomposing the > multiplication by sqrt2. There it is slightly more complex. The best > we can probably do is to apply the usual bitshifts first, then apply a > sqrt2 combined with any limb shifts, then do a sumdiff. That's three > operations, but without additional assembly functions I don't see how > to do better. At least any negations can be more easily combined with > the sumdiff. > > 2) Come up with a decent negacyclic convolution strategy. I think a > lot of work could be done on this, even research. It's currently > inefficient because we need to use a multiple of half the FFT length > for the coefficient bit size, which means too much padding. > > FLINT has some fairly intricate strategy which I barely understand. I > think it uses a naive negacyclic convolution with *coefficients* > modulo a power of the limb size, i.e. mod B^i for some small i and > combines it with the information you get from an FFT negacyclic > convolution with *coefficients* modulo some 2^b+1. It's still a > negacyclic convolution, so that still gives you the product mod 2^B+1 > that you are after. I'm not sure if that is the idea or not. I just > remembering it being a real headscrew for me when I tried to figure it > out. > > I don't think the FLINT implementation was ever carried through to its > logical conclusion however. It does work, but I don't know about the > efficiency. It is definitely more efficient than what I am currently > doing however, which is nothing. From memory the idea in FLINT was to > force the pointwise multiplication *of the FFT negacyclic convolution* > to be a multiple of 3 limbs, so that the identity 2^3B+1 = > (2^B+1)(2^2B-2^B+1) can be used, as explained in an earlier section. > > I never got through thinking about whether that strategy was also > worth pursuing at the FFT level itself and not just for the pointwise > mults. Is there a karatsuba and toom negacyclic convolution algorithm? > I don't know the answer to these questions. I didn't think about > whether there was. > > Anyhow, there is no requirement that we do what is done in FLINT, just > that we do something sensible. > > 3) Implement a transform with coefficients mod p = 2^{sB}-1. This > would still be a standard cyclic convolution, using an "ordinary FFT", > just that you are doing your reductions mod p with an addition instead > of a subtraction. This would require rewriting the butterflies to work > modulo this different p. I didn't check if there was a sqrt2 trick, > and certainly convolutions of only half the length can be supported. > But I am sure the savings in the pointwise multiplications will make > it worth implementing this trick. With truncation, efficiency becomes > much less of an issue. > > Now I proceed through new_fft.c one function at a time and mention > explicitly what needs to be done, while it is all still fresh on my > mind. I will number the ideas starting from 4, so that people can > volunteer to take them on, by number, if they should so choose. > > Feel free to add suggestions to the list. > > mpn_addmod_2expp1_1: > > 4) As noted in the code comments, this can probably be done faster in > the case where c is small (the generic case), whether positive or > negative. In fact it is probably only ever used in such cases. > > 5) The function should probably be renamed. The name doesn't make a > whole lot of sense. > > revbin: > > 6) This is very inefficient. I don't know an efficient algorithm, but > I am sure one exists. > > 7) The name should be changed to something more exotic to avoid > polluting namespace when linking with other libraries. > > FFT_split* : > > 8) We need a function which given the parameters currently sent to > FFT_split_bits and a coefficient number, just returns that single > coefficient. The reason for this is that splitting the integer up into > chunks (to the bit) can then be merged with the top layer of the FFT, > which has consequences for cache locality. In particular if you were > doing an MFA transform, you'd pull out only the coefficients needed > for each column FFT at a time, then immediately do that FFT. > > 9) Experiment with cache hints. On Opterons in particular we found > this to give a speedup in FLINT. The current code indicates where > cache hints should be used, though combine this strategy with 8. > > 10) Replace any ulong's which should really be mp_limb_t's so that > things will work on Windows. > > FFT_combine*: > > 11) Again it might be possible to combine just a single coefficient at > a time so that cache locality can be maintained for the MFA IFFT's. > One has to be slightly careful to avoid a quadratic run time, but for > unsigned coefficients I think it will be OK. One also needs to > selectively zero the limbs of the output integer in step with these > coefficient combines I would guess. So that step should probably be > combined with this somehow. Needs careful thought. > > 12) Pass in temporary space from the main integer multiplication routine. > > mpn_submod_i_2expp1: > > 13) Remove as it is only used in the split radix code, which is not utilised. > > mpn_lshB_sumdiffmod_2expp1: > mpn_sumdiff_rshBmod_2expp1: > > 14) The name's are stupid and should be changed. Too specific to be > mpn functions. > > 15) Is there some symmetry which can be utilised to make the functions > simpler. There seem to be a lot of cases. I don't think there is, as > one is doing a subtraction, however combining with a possible negation > without increasing the code complexity might be possible. Certainly > the case x = 0 is the (I)FFT butterfly case so surely some > consolidation is possible there. > > 16) Can the entire functions be done in assembly? Either way, the > overheads need to be reduced as much as possible, and we still need a > generic C implementation. > > mpn_mul_2expmod_2expp1: > mpn_div_2expmod_2expp1: > > 17) The first function should probably just deal with all cases, > whether bit shifts, or including limb shifts, positive or negative. I > ended up implementing FFT and negacyclic_twiddle functions because > this didn't support limb shifts, and these were needed in the > truncated FFT. So some rationalisation is possible there. Do away with > the mpn_div_2expmod_2expp1 function. Do right shift by d bits by doing > left shift by 2*n*w - d bits and make this function handle the > negation. But see how the FFT_twiddle and FFT_negacyclic_twiddle > functions are implemented which return an int to indicate if they did > their operation in-place or not to save copying data around when it is > not necessary. Perhaps just pass the temporary in as a parameter and > do the swap internal to the function. > > 18) Can this function be optimised in assembly? > > FFT_split_radix_butterfly: > FFT_split_radix_butterfly2: > > 19) Remove as no longer utilised. > > FFT_radix2_twiddle_butterfly: > FFT_radix2_twiddle_inverse_butterfly: > > 20) Surely this can be combined with the ordinary > (IF)FFT_radix2_butterfly functions. However note that it is better to > pass nw than to pass n and w separately as we do elsewhere. Perhaps > rename this parameter to nw or limbs, not NW. > > 21) The radix2 in the name is redundant, as we only implement radix 2, > and other radices are impractical. Omit the twiddle from the name. > > 22) The motif: > > b1 %= (2*NW); > if (b1 >= NW) > { > negate2 = 1; > b1 -= NW; > } > x = b1/GMP_LIMB_BITS; > b1 -= x*GMP_LIMB_BITS; > > occurs often in this and other functions, and should be adapted away > into a macro or something like that. It's not necessarily optimised > either. > > FFT_radix2_butterfly: > FFT_radix2_inverse_butterfly: > > 23) Surely better to pass nw and an actual bit shift rather than i, > since otherwise n*w needs to be computed over and over, or at least > i*w does. > > 24) In FFT_radix2_butterfly, I don't think i >= n is actually > possible, so remove that. > > FFT_split_radix: > > 25) Remove, as it is not utilised. > > FFT_radix2: > > 26) Remove as it is no longer used. > > FFT_radix2_new: > FFT_radix2_twiddle: > IFFT_radix2_new: > IFFT_radix2_twiddle: > > 27) radix2 is superfluous in the name, as is the new. > > 28) combine the non-twiddle functions with the twiddle versions. > > 29) I don't believe the result stride rs is ever used, and we always > do the FFT and IFFT in-place, so remove rr and rs entirely. > > 30) It is no longer necessary to pass temp. It isn't used. > > 31) Can t1 and t2 be combined without a loss of speed? Please > benchmark and be *absolutely sure*. > > 32) The motif: > > ptr = rr[0]; > rr[0] = *t1; > *t1 = ptr; > > occurs frequently here and elsewhere and should definitely be adapted > away into a pointer swap macro, or use one of the already provided > MPIR macros for this. > > FFT_negacyclic_twiddle: > FFT_twiddle: > > 33) Remove, see 17. However note that an indication of how to combine > the negation is provided in commented out code (which is not tested). > > FFT_radix2_truncate1: > FFT_radix2_truncate: > FFT_radix2_truncate1_twiddle: > FFT_radix2_truncate_twiddle: > IFFT_radix2_truncate1: > IFFT_radix2_truncate1_twiddle: > IFFT_radix2_truncate: > IFFT_radix2_truncate1_twiddle: > > 34) Combine non-twiddle and twiddle versions. Remove radix2 from name. > > 35) Remove rr and rs. Do in-place. > > 36) Correct code comments which are cut and pasted from elsewhere. > > 37) Write special cases for n = 1, 2 to remove the unnecessary > restriction that trunc be divisible by 2. Remember to remove the > restriction in the multiplication code which sets trunc to a multiple > of 2*sqrt instead of just a multiple of sqrt. > > FFT_radix2_negacyclic: > IFFT_radix2_negacyclic: > > 38) Remove rr and rs. Always work in-place. > > 39) Do not pass temp. > > FFT_radix2_mfa: > FFT_radix2_mfa_truncate: > IFFT_radix2_mfa: > IFFT_radix2_mfa_truncate: > > 40) Do not pass temp. > > 41) Combine truncate and non-truncate versions. Actually the > non-truncate versions are not really needed. > > FFT_mulmod_2expp1: > > 42) Memory allocation should be done in a single block, or even passed > in as a temporary. > > 43) Clean up the code which handles the subtractions and make it cache > friendly. > > new_mpn_mulmod_2expp1: > > 44) Doesn't currently deal with the case that c != 0 when using the > negacyclic convolution (which it currently never uses anyway because > it is too inefficient), i.e. the case where one of the inputs is > 2^{n*w} so that it doesn't fit into {n*w}/GMP_LIMB_BITS limbs. > > 45) Doesn't currently return any carry, i.e. when the result is p-1 > there is a carry. > > new_mpn_mul_mfa: > > 46) In the memory allocation part, one shouldn't mix allocations of > arrays of mp_limb_t's and mp_limb_t *'s as this is not necessarily > safe on all machines. Or is it? > > General: > > 47) The code needs to be made Windows safe by avoiding variable > declarations except at the beginnings of blocks. > > 48) Some functions don't have adequate test functions. > > 49) I think the normalise function is correct, but the test code > doesn't check all cases. > > 50) The test code doesn't test the FFTs, IFFTs and multiplication > functions for lots of different sizes, especially different values of > n, w, and trunc. > > Bill. > > > > > > > > > > > > > > > > > 2010/1/3 Bill Hart <[email protected]>: >> 5. References. >> >> I am indebted to the people who prepared the following references: >> >> "Matters Computational: ideas, algorithms and source code", by J org >> Arndt, see http://www.jjj.de/fxt/fxtbook.pdf >> >> Specifically I used chapter 3. Note that I took very great care to not >> refer to the source code. I found the notation used in equations such >> as 19.2-6 very helpful, and this book was also the resource I used for >> a description of the Matrix Fourier Algorithm and the split radix >> algorithm (which did not actually end up being used by my integer >> multiplication implementation). >> >> "Primes numbers: a computational perspective", by Richard Crandall and >> Carl Pomerance, 2nd ed., 2005, Springer. >> >> Specifically I used this reference to figure out why my negacyclic >> convolution didn't work (I had forgotten that the coefficients would >> be signed). >> >> Two other references which I referred to, but which did not end up >> featuring in the implementation I settled on (so far), but to which I >> am grateful to the authors for providing are: >> >> "A GMP-based implementation of Schonhage-Strassen's Large Integer >> Multiplication Algorithm" by Pierrick Gaudry, Alexander Kruppa and >> Paul Zimmermann, ISAAC 2007 proceedings, pp 167-174. See >> http://www.loria.fr/~gaudry/publis/issac07.pdf >> >> and >> >> "Multidigit multiplication for mathematicians" by Dan Bernstein (to >> appear). see http://cr.yp.to/papers/m3.pdf >> >> Specifically this paper helped me sort out why my radix 2/3 >> implementation didn't work. I would also say that the algebraic way of >> thinking about the FFT which is advocated in this paper also helped >> when it came to thinking up a truncation strategy. I had been looking >> at this paper to work out why my radix 2/3 thing didn't work and went >> for a walk to town in frustration. By the time I got back I had >> basically figured out how to do truncation without radix 3. The >> initial thought that set it off was to think about the FFT as breaking >> into a cyclic and negacyclic half, as described in section 2 of these >> notes. >> >> Bill. >> >> 2010/1/3 Bill Hart <[email protected]>: >>> Hmm, so the only reason MPIR can be beating me for small >>> multiplications is general coding efficiency. My butterfly strategy >>> probably introduces overheads when the coefficients are small, i.e. >>> only a few limbs. >>> >>> The overheads could be lowered by having an iterative FFT for short >>> lengths. But we should worry about that later. It will complicate the >>> code too much and the existing code needs consolidation first. There's >>> a lot of code duplication currently. >>> >>> Bill. >>> >>> 2010/1/3 Bill Hart <[email protected]>: >>>> 2010/1/3 David Harvey <[email protected]>: >>>>> >>>>> >>>>> On Jan 3, 9:03 am, Bill Hart <[email protected]> wrote: >>>>> >>>>>> I figured I probably did misunderstand. What I think they are trying >>>>>> to get at is that one can use coefficients modulo 2^N+1 except at >>>>>> certain levels because the roots of unity are actually the same. But >>>>>> then the discussion becomes one of combining a Fermat and Mersenne >>>>>> transform, which is the basic strategy that the implementation uses. I >>>>>> got confused. I still am. >>>>> >>>>> I think they are saying that to multiply two integers whose product >>>>> has N bits, choose a and b (as in Lemma 1 of their paper) such that a >>>>> + b ~ N and then multiply modulo 2^a - 1 (use Mersenne transform) and >>>>> modulo 2^b + 1 (use Fermat transform). The smoothness comes from the >>>>> freedom to choose a and b. >>>>> >>>> >>>> OK, I think I see. Their big N corresponds to an FFT for the original >>>> integers, which are multiplied mod 2^N+1 in the case of a Fermat >>>> transform or 2^N-1 in the case of a Mersenne transform. I think I >>>> would call these negacyclic and cyclic convolutions respectively. >>>> >>>> But then a Mersenne transform is nothing but an ordinary FFT, no? >>>> >>>> The idea of combining with a negacyclic FFT seems to be one of the >>>> strategies I discussed and discarded. >>>> >>>> I still don't understand what they mean by "power-of-2 roots of unity >>>> are needed only at the "lower level"". I understand that by the "lower >>>> level" they seem to mean the pointwise mults. But what is a power of 2 >>>> root of unity? Aren't they all powers of 2, except when using the >>>> sqrt2 trick. >>>> >>>> Maybe I don't understand the Schonhage-Strassen algorithm correctly. >>>> Perhaps they not only worked with coefficients mod 2^B+1 but also did >>>> the convolution mod x^m+1. They were worried about asymptotic >>>> complexity, not speed, so they wanted their algorithm to recurse so >>>> that the pointwise mults could be done using the same algorithm. Gosh >>>> that is exactly what the authors of this paper say! >>>> >>>> So my misunderstanding came from the fact that I assumed SS advocated >>>> a Mersenne transform, i.e. mod x^m-1, not a Fermat transform, mod >>>> x^m+1. >>>> >>>> Well, just to be clear. I worked mod x^m-1 because it is more >>>> efficient, and in a ring mod p where p = 2^B+1, for which one must >>>> perform a negacyclic convolution if one wishes to recurse, which >>>> involves twiddling to get an ordinary cyclic convolution. >>>> >>>> Anyhow, this is all a non-strategy once you have truncation. All that >>>> is required is to implement a single transform which works with >>>> coefficients mod 2^B-1, which I was formerly calling a Mersenne ring, >>>> when the large integers being multiplied are "small". Then you don't >>>> want to recurse for doing the pointwise mults and so the pointwise >>>> mults can be done mod 2^B/2-1 and 2^B/2+1 using our mpn_mulmod_2expp1 >>>> and mpn_mulmod_2expm1, which can do whatever is most efficient. >>>> >>>> Bill. >>>> >>> >> > -- You received this message because you are subscribed to the Google Groups "mpir-devel" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/mpir-devel?hl=en.
