Re: [mpir-devel] Re: A new FFT for the New Year

Bill Hart Sun, 03 Jan 2010 12:45:09 -0800

I've now committed the new code I mentioned to the repo. I didn't
actually need to change anything to get the test code to pass after
all! :-)


Bill.

2010/1/3 Bill Hart <[email protected]>:
> 8. What remains to be done.
>
> Wow, I made it all the way to section 8! This is the most important
> section, where I describe what remains, in order to make this
> implementation really fast, clean and worth switching out the existing
> implementation in MPIR.
>
> I'm very interested if people would like to volunteer to work on
> various projects. They range from the completely trivial, which can be
> carried out by anyone, right through to things that could turn into
> research projects (some of the first four things below).
>
> This list applies to the version of the code that I am about to commit
> to the repo, which is slightly modified from the version currently in
> there. Specifically I'm talking about new_fft.c, i.e. the one that
> doesn't need FLINT (which I don't see any point in continuing to
> update - it's there for reference only).
>
> Clearly the high level ideas that need the most work are:
>
> 1) Implement the sqrt2 trick. In the ring Z/pZ where p = 2^B + 1 the
> value 2^(2B/4)-2^(B/4) is a square root of 2. This means you can have
> convolutions of length 4m instead of 2m by using these extra roots of
> unity. It should be the case that these are only needed in the very
> top layer of the FFT and IFFT. However one should still implement
> these efficiently. Obviously one decomposes a multiplication by
> (sqrt(2))^{2i+1} into 2^i*sqrt(2) (and clearly sqrt(2)^{2i} just
> becomes 2^i which we can already do).
>
> The left hand half of the FFT butterfly is unaffected, but the right
> hand half will need multiplication by sqrt. One can multiply by
> 2^{B/4} first, combining this with the existing butterfly shifts. Then
> one addmuls by a further 2^(B/4) for the other "leg" of the sqrt2.
> Whilst this is less efficient than a standard butterfly it allows one
> to double the convolution length, work with smaller coefficients in
> the pointwise mults and the cost is only small, at the very top layer.
>
> For the IFFT one must do a similar thing, decomposing the
> multiplication by sqrt2. There it is slightly more complex. The best
> we can probably do is to apply the usual bitshifts first, then apply a
> sqrt2 combined with any limb shifts, then do a sumdiff. That's three
> operations, but without additional assembly functions I don't see how
> to do better. At least any negations can be more easily combined with
> the sumdiff.
>
> 2) Come up with a decent negacyclic convolution strategy. I think a
> lot of work could be done on this, even research. It's currently
> inefficient because we need to use a multiple of half the FFT length
> for the coefficient bit size, which means too much padding.
>
> FLINT has some fairly intricate strategy which I barely understand. I
> think it uses a naive negacyclic convolution with *coefficients*
> modulo a power of the limb size, i.e. mod B^i for some small i and
> combines it with the information you get from an FFT negacyclic
> convolution with *coefficients* modulo some 2^b+1. It's still a
> negacyclic convolution, so that still gives you the product mod 2^B+1
> that you are after. I'm not sure if that is the idea or not. I just
> remembering it being a real headscrew for me when I tried to figure it
> out.
>
> I don't think the FLINT implementation was ever carried through to its
> logical conclusion however. It does work, but I don't know about the
> efficiency. It is definitely more efficient than what I am currently
> doing however, which is nothing. From memory the idea in FLINT was to
> force the pointwise multiplication *of the FFT negacyclic convolution*
> to be a multiple of 3 limbs, so that the identity 2^3B+1 =
> (2^B+1)(2^2B-2^B+1) can be used, as explained in an earlier section.
>
> I never got through thinking about whether that strategy was also
> worth pursuing at the FFT level itself and not just for the pointwise
> mults. Is there a karatsuba and toom negacyclic convolution algorithm?
> I don't know the answer to these questions. I didn't think about
> whether there was.
>
> Anyhow, there is no requirement that we do what is done in FLINT, just
> that we do something sensible.
>
> 3) Implement a transform with coefficients mod p = 2^{sB}-1. This
> would still be a standard cyclic convolution, using an "ordinary FFT",
> just that you are doing your reductions mod p with an addition instead
> of a subtraction. This would require rewriting the butterflies to work
> modulo this different p. I didn't check if there was a sqrt2 trick,
> and certainly convolutions of only half the length can be supported.
> But I am sure the savings in the pointwise multiplications will make
> it worth implementing this trick. With truncation, efficiency becomes
> much less of an issue.
>
> Now I proceed through new_fft.c one function at a time and mention
> explicitly what needs to be done, while it is all still fresh on my
> mind. I will number the ideas starting from 4, so that people can
> volunteer to take them on, by number, if they should so choose.
>
> Feel free to add suggestions to the list.
>
> mpn_addmod_2expp1_1:
>
> 4) As noted in the code comments, this can probably be done faster in
> the case where c is small (the generic case), whether positive or
> negative. In fact it is probably only ever used in such cases.
>
> 5) The function should probably be renamed. The name doesn't make a
> whole lot of sense.
>
> revbin:
>
> 6) This is very inefficient. I don't know an efficient algorithm, but
> I am sure one exists.
>
> 7) The name should be changed to something more exotic to avoid
> polluting namespace when linking with other libraries.
>
> FFT_split* :
>
> 8) We need a function which given the parameters currently sent to
> FFT_split_bits and a coefficient number, just returns that single
> coefficient. The reason for this is that splitting the integer up into
> chunks (to the bit) can then be merged with the top layer of the FFT,
> which has consequences for cache locality. In particular if you were
> doing an MFA transform, you'd pull out only the coefficients needed
> for each column FFT at a time, then immediately do that FFT.
>
> 9) Experiment with cache hints. On Opterons in particular we found
> this to give a speedup in FLINT. The current code indicates where
> cache hints should be used, though combine this strategy with 8.
>
> 10) Replace any ulong's which should really be mp_limb_t's so that
> things will work on Windows.
>
> FFT_combine*:
>
> 11) Again it might be possible to combine just a single coefficient at
> a time so that cache locality can be maintained for the MFA IFFT's.
> One has to be slightly careful to avoid a quadratic run time, but for
> unsigned coefficients I think it will be OK. One also needs to
> selectively zero the limbs of the output integer in step with these
> coefficient combines I would guess. So that step should probably be
> combined with this somehow. Needs careful thought.
>
> 12) Pass in temporary space from the main integer multiplication routine.
>
> mpn_submod_i_2expp1:
>
> 13) Remove as it is only used in the split radix code, which is not utilised.
>
> mpn_lshB_sumdiffmod_2expp1:
> mpn_sumdiff_rshBmod_2expp1:
>
> 14) The name's are stupid and should be changed. Too specific to be
> mpn functions.
>
> 15) Is there some symmetry which can be utilised to make the functions
> simpler. There seem to be a lot of cases.  I don't think there is, as
> one is doing a subtraction, however combining with a possible negation
> without increasing the code complexity might be possible. Certainly
> the case x = 0 is the (I)FFT butterfly case so surely some
> consolidation is possible there.
>
> 16) Can the entire functions be done in assembly? Either way, the
> overheads need to be reduced as much as possible, and we still need a
> generic C implementation.
>
> mpn_mul_2expmod_2expp1:
> mpn_div_2expmod_2expp1:
>
> 17) The first function should probably just deal with all cases,
> whether bit shifts, or including limb shifts, positive or negative. I
> ended up implementing FFT and negacyclic_twiddle functions because
> this didn't support limb shifts, and these were needed in the
> truncated FFT. So some rationalisation is possible there. Do away with
> the mpn_div_2expmod_2expp1 function. Do right shift by d bits by doing
> left shift by 2*n*w - d bits and make this function handle the
> negation. But see how the FFT_twiddle and FFT_negacyclic_twiddle
> functions are implemented which return an int to indicate if they did
> their operation in-place or not to save copying data around when it is
> not necessary. Perhaps just pass the temporary in as a parameter and
> do the swap internal to the function.
>
> 18) Can this function be optimised in assembly?
>
> FFT_split_radix_butterfly:
> FFT_split_radix_butterfly2:
>
> 19) Remove as no longer utilised.
>
> FFT_radix2_twiddle_butterfly:
> FFT_radix2_twiddle_inverse_butterfly:
>
> 20) Surely this can be combined with the ordinary
> (IF)FFT_radix2_butterfly functions. However note that it is better to
> pass nw than to pass n and w separately as we do elsewhere. Perhaps
> rename this parameter to nw or limbs, not NW.
>
> 21) The radix2 in the name is redundant, as we only implement radix 2,
> and other radices are impractical. Omit the twiddle from the name.
>
> 22) The motif:
>
>   b1 %= (2*NW);
>   if (b1 >= NW)
>   {
>      negate2 = 1;
>      b1 -= NW;
>   }
>   x = b1/GMP_LIMB_BITS;
>   b1 -= x*GMP_LIMB_BITS;
>
> occurs often in this and other functions, and should be adapted away
> into a macro or something like that. It's not necessarily optimised
> either.
>
> FFT_radix2_butterfly:
> FFT_radix2_inverse_butterfly:
>
> 23) Surely better to pass nw and an actual bit shift rather than i,
> since otherwise n*w needs to be computed over and over, or at least
> i*w does.
>
> 24) In FFT_radix2_butterfly, I don't think i >= n is actually
> possible, so remove that.
>
> FFT_split_radix:
>
> 25) Remove, as it is not utilised.
>
> FFT_radix2:
>
> 26) Remove as it is no longer used.
>
> FFT_radix2_new:
> FFT_radix2_twiddle:
> IFFT_radix2_new:
> IFFT_radix2_twiddle:
>
> 27) radix2 is superfluous in the name, as is the new.
>
> 28) combine the non-twiddle functions with the twiddle versions.
>
> 29) I don't believe the result stride rs is ever used, and we always
> do the FFT and IFFT in-place, so remove rr and rs entirely.
>
> 30) It is no longer necessary to pass temp. It isn't used.
>
> 31) Can t1 and t2 be combined without a loss of speed? Please
> benchmark and be *absolutely sure*.
>
> 32) The motif:
>
>      ptr = rr[0];
>      rr[0] = *t1;
>      *t1 = ptr;
>
> occurs frequently here and elsewhere and should definitely be adapted
> away into a pointer swap macro, or use one of the already provided
> MPIR macros for this.
>
> FFT_negacyclic_twiddle:
> FFT_twiddle:
>
> 33) Remove, see 17. However note that an indication of how to combine
> the negation is provided in commented out code (which is not tested).
>
> FFT_radix2_truncate1:
> FFT_radix2_truncate:
> FFT_radix2_truncate1_twiddle:
> FFT_radix2_truncate_twiddle:
> IFFT_radix2_truncate1:
> IFFT_radix2_truncate1_twiddle:
> IFFT_radix2_truncate:
> IFFT_radix2_truncate1_twiddle:
>
> 34) Combine non-twiddle and twiddle versions. Remove radix2 from name.
>
> 35) Remove rr and rs. Do in-place.
>
> 36) Correct code comments which are cut and pasted from elsewhere.
>
> 37) Write special cases for n = 1, 2 to remove the unnecessary
> restriction that trunc be divisible by 2. Remember to remove the
> restriction in the multiplication code which sets trunc to a multiple
> of 2*sqrt instead of just a multiple of sqrt.
>
> FFT_radix2_negacyclic:
> IFFT_radix2_negacyclic:
>
> 38) Remove rr and rs. Always work in-place.
>
> 39) Do not pass temp.
>
> FFT_radix2_mfa:
> FFT_radix2_mfa_truncate:
> IFFT_radix2_mfa:
> IFFT_radix2_mfa_truncate:
>
> 40) Do not pass temp.
>
> 41) Combine truncate and non-truncate versions. Actually the
> non-truncate versions are not really needed.
>
> FFT_mulmod_2expp1:
>
> 42) Memory allocation should be done in a single block, or even passed
> in as a temporary.
>
> 43) Clean up the code which handles the subtractions and make it cache
> friendly.
>
> new_mpn_mulmod_2expp1:
>
> 44) Doesn't currently deal with the case that c != 0 when using the
> negacyclic convolution (which it currently never uses anyway because
> it is too inefficient), i.e. the case where one of the inputs is
> 2^{n*w} so that it doesn't fit into {n*w}/GMP_LIMB_BITS limbs.
>
> 45) Doesn't currently return any carry, i.e. when the result is p-1
> there is a carry.
>
> new_mpn_mul_mfa:
>
> 46) In the memory allocation part, one shouldn't mix allocations of
> arrays of mp_limb_t's and mp_limb_t *'s as this is not necessarily
> safe on all machines. Or is it?
>
> General:
>
> 47) The code needs to be made Windows safe by avoiding variable
> declarations except at the beginnings of blocks.
>
> 48) Some functions don't have adequate test functions.
>
> 49) I think the normalise function is correct, but the test code
> doesn't check all cases.
>
> 50) The test code doesn't test the FFTs, IFFTs and multiplication
> functions for lots of different sizes, especially different values of
> n, w, and trunc.
>
> Bill.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> 2010/1/3 Bill Hart <[email protected]>:
>> 5. References.
>>
>> I am indebted to the people who prepared the following references:
>>
>> "Matters Computational: ideas, algorithms and source code", by J org
>> Arndt, see http://www.jjj.de/fxt/fxtbook.pdf
>>
>> Specifically I used chapter 3. Note that I took very great care to not
>> refer to the source code. I found the notation used in equations such
>> as 19.2-6 very helpful, and this book was also the resource I used for
>> a description of the Matrix Fourier Algorithm and the split radix
>> algorithm (which did not actually end up being used by my integer
>> multiplication implementation).
>>
>> "Primes numbers: a computational perspective", by Richard Crandall and
>> Carl Pomerance, 2nd ed., 2005, Springer.
>>
>> Specifically I used this reference to figure out why my negacyclic
>> convolution didn't work (I had forgotten that the coefficients would
>> be signed).
>>
>> Two other references which I referred to, but which did not end up
>> featuring in the implementation I settled on (so far), but to which I
>> am grateful to the authors for providing are:
>>
>> "A GMP-based implementation of Schonhage-Strassen's Large Integer
>> Multiplication Algorithm" by Pierrick Gaudry, Alexander Kruppa and
>> Paul Zimmermann, ISAAC 2007 proceedings, pp 167-174. See
>> http://www.loria.fr/~gaudry/publis/issac07.pdf
>>
>> and
>>
>> "Multidigit multiplication for mathematicians" by Dan Bernstein (to
>> appear). see http://cr.yp.to/papers/m3.pdf
>>
>> Specifically this paper helped me sort out why my radix 2/3
>> implementation didn't work. I would also say that the algebraic way of
>> thinking about the FFT which is advocated in this paper also helped
>> when it came to thinking up a truncation strategy. I had been looking
>> at this paper to work out why my radix 2/3 thing didn't work and went
>> for a walk to town in frustration. By the time I got back I had
>> basically figured out how to do truncation without radix 3. The
>> initial thought that set it off was to think about the FFT as breaking
>> into a cyclic and negacyclic half, as described in section 2 of these
>> notes.
>>
>> Bill.
>>
>> 2010/1/3 Bill Hart <[email protected]>:
>>> Hmm, so the only reason MPIR can be beating me for small
>>> multiplications is general coding efficiency. My butterfly strategy
>>> probably introduces overheads when the coefficients are small, i.e.
>>> only a few limbs.
>>>
>>> The overheads could be lowered by having an iterative FFT for short
>>> lengths. But we should worry about that later. It will complicate the
>>> code too much and the existing code needs consolidation first. There's
>>> a lot of code duplication currently.
>>>
>>> Bill.
>>>
>>> 2010/1/3 Bill Hart <[email protected]>:
>>>> 2010/1/3 David Harvey <[email protected]>:
>>>>>
>>>>>
>>>>> On Jan 3, 9:03 am, Bill Hart <[email protected]> wrote:
>>>>>
>>>>>> I figured I probably did misunderstand. What I think they are trying
>>>>>> to get at is that one can use coefficients modulo 2^N+1 except at
>>>>>> certain levels because the roots of unity are actually the same. But
>>>>>> then the discussion becomes one of combining a Fermat and Mersenne
>>>>>> transform, which is the basic strategy that the implementation uses. I
>>>>>> got confused. I still am.
>>>>>
>>>>> I think they are saying that to multiply two integers whose product
>>>>> has N bits, choose a and b (as in Lemma 1 of their paper) such that a
>>>>> + b ~ N and then multiply modulo 2^a - 1 (use Mersenne transform) and
>>>>> modulo 2^b + 1 (use Fermat transform). The smoothness comes from the
>>>>> freedom to choose a and b.
>>>>>
>>>>
>>>> OK, I think I see. Their big N corresponds to an FFT for the original
>>>> integers, which are multiplied mod 2^N+1 in the case of a Fermat
>>>> transform or 2^N-1 in the case of a Mersenne transform. I think I
>>>> would call these negacyclic and cyclic convolutions respectively.
>>>>
>>>> But then a Mersenne transform is nothing but an ordinary FFT, no?
>>>>
>>>> The idea of combining with a negacyclic FFT seems to be one of the
>>>> strategies I discussed and discarded.
>>>>
>>>> I still don't understand what they mean by "power-of-2 roots of unity
>>>> are needed only at the "lower level"". I understand that by the "lower
>>>> level" they seem to mean the pointwise mults. But what is a power of 2
>>>> root of unity? Aren't they all powers of 2, except when using the
>>>> sqrt2 trick.
>>>>
>>>> Maybe I don't understand the Schonhage-Strassen algorithm correctly.
>>>> Perhaps they not only worked with coefficients mod 2^B+1 but also did
>>>> the convolution mod x^m+1. They were worried about asymptotic
>>>> complexity, not speed, so they wanted their algorithm to recurse so
>>>> that the pointwise mults could be done using the same algorithm. Gosh
>>>> that is exactly what the authors of this paper say!
>>>>
>>>> So my misunderstanding came from the fact that I assumed SS advocated
>>>> a Mersenne transform, i.e. mod x^m-1, not a Fermat transform, mod
>>>> x^m+1.
>>>>
>>>> Well, just to be clear. I worked mod x^m-1 because it is more
>>>> efficient, and in a ring mod p where p = 2^B+1, for which one must
>>>> perform a negacyclic convolution if one wishes to recurse, which
>>>> involves twiddling to get an ordinary cyclic convolution.
>>>>
>>>> Anyhow, this is all a non-strategy once you have truncation. All that
>>>> is required is to implement a single transform which works with
>>>> coefficients mod 2^B-1, which I was formerly calling a Mersenne ring,
>>>> when the large integers being multiplied are "small". Then you don't
>>>> want to recurse for doing the pointwise mults and so the pointwise
>>>> mults can be done mod 2^B/2-1 and 2^B/2+1 using our mpn_mulmod_2expp1
>>>> and mpn_mulmod_2expm1, which can do whatever is most efficient.
>>>>
>>>> Bill.
>>>>
>>>
>>
>

--

You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en.

Re: [mpir-devel] Re: A new FFT for the New Year

Reply via email to