Re: [mpir-devel] Re: A new FFT for the New Year

Bill Hart Sun, 03 Jan 2010 12:24:14 -0800

8. What remains to be done.

Wow, I made it all the way to section 8! This is the most important
section, where I describe what remains, in order to make this
implementation really fast, clean and worth switching out the existing
implementation in MPIR.

I'm very interested if people would like to volunteer to work on
various projects. They range from the completely trivial, which can be
carried out by anyone, right through to things that could turn into
research projects (some of the first four things below).

This list applies to the version of the code that I am about to commit
to the repo, which is slightly modified from the version currently in
there. Specifically I'm talking about new_fft.c, i.e. the one that
doesn't need FLINT (which I don't see any point in continuing to
update - it's there for reference only).

Clearly the high level ideas that need the most work are:

1) Implement the sqrt2 trick. In the ring Z/pZ where p = 2^B + 1 the
value 2^(2B/4)-2^(B/4) is a square root of 2. This means you can have
convolutions of length 4m instead of 2m by using these extra roots of
unity. It should be the case that these are only needed in the very
top layer of the FFT and IFFT. However one should still implement
these efficiently. Obviously one decomposes a multiplication by
(sqrt(2))^{2i+1} into 2^i*sqrt(2) (and clearly sqrt(2)^{2i} just
becomes 2^i which we can already do).

The left hand half of the FFT butterfly is unaffected, but the right
hand half will need multiplication by sqrt. One can multiply by
2^{B/4} first, combining this with the existing butterfly shifts. Then
one addmuls by a further 2^(B/4) for the other "leg" of the sqrt2.
Whilst this is less efficient than a standard butterfly it allows one
to double the convolution length, work with smaller coefficients in
the pointwise mults and the cost is only small, at the very top layer.

For the IFFT one must do a similar thing, decomposing the
multiplication by sqrt2. There it is slightly more complex. The best
we can probably do is to apply the usual bitshifts first, then apply a
sqrt2 combined with any limb shifts, then do a sumdiff. That's three
operations, but without additional assembly functions I don't see how
to do better. At least any negations can be more easily combined with
the sumdiff.

2) Come up with a decent negacyclic convolution strategy. I think a
lot of work could be done on this, even research. It's currently
inefficient because we need to use a multiple of half the FFT length
for the coefficient bit size, which means too much padding.

FLINT has some fairly intricate strategy which I barely understand. I
think it uses a naive negacyclic convolution with *coefficients*
modulo a power of the limb size, i.e. mod B^i for some small i and
combines it with the information you get from an FFT negacyclic
convolution with *coefficients* modulo some 2^b+1. It's still a
negacyclic convolution, so that still gives you the product mod 2^B+1
that you are after. I'm not sure if that is the idea or not. I just
remembering it being a real headscrew for me when I tried to figure it
out.

I don't think the FLINT implementation was ever carried through to its
logical conclusion however. It does work, but I don't know about the
efficiency. It is definitely more efficient than what I am currently
doing however, which is nothing. From memory the idea in FLINT was to
force the pointwise multiplication *of the FFT negacyclic convolution*
to be a multiple of 3 limbs, so that the identity 2^3B+1 =
(2^B+1)(2^2B-2^B+1) can be used, as explained in an earlier section.

I never got through thinking about whether that strategy was also
worth pursuing at the FFT level itself and not just for the pointwise
mults. Is there a karatsuba and toom negacyclic convolution algorithm?
I don't know the answer to these questions. I didn't think about
whether there was.

Anyhow, there is no requirement that we do what is done in FLINT, just
that we do something sensible.

3) Implement a transform with coefficients mod p = 2^{sB}-1. This
would still be a standard cyclic convolution, using an "ordinary FFT",
just that you are doing your reductions mod p with an addition instead
of a subtraction. This would require rewriting the butterflies to work
modulo this different p. I didn't check if there was a sqrt2 trick,
and certainly convolutions of only half the length can be supported.
But I am sure the savings in the pointwise multiplications will make
it worth implementing this trick. With truncation, efficiency becomes
much less of an issue.

Now I proceed through new_fft.c one function at a time and mention
explicitly what needs to be done, while it is all still fresh on my
mind. I will number the ideas starting from 4, so that people can
volunteer to take them on, by number, if they should so choose.

Feel free to add suggestions to the list.

mpn_addmod_2expp1_1:

4) As noted in the code comments, this can probably be done faster in
the case where c is small (the generic case), whether positive or
negative. In fact it is probably only ever used in such cases.

5) The function should probably be renamed. The name doesn't make a
whole lot of sense.

revbin:

6) This is very inefficient. I don't know an efficient algorithm, but
I am sure one exists.

7) The name should be changed to something more exotic to avoid
polluting namespace when linking with other libraries.

FFT_split* :

8) We need a function which given the parameters currently sent to
FFT_split_bits and a coefficient number, just returns that single
coefficient. The reason for this is that splitting the integer up into
chunks (to the bit) can then be merged with the top layer of the FFT,
which has consequences for cache locality. In particular if you were
doing an MFA transform, you'd pull out only the coefficients needed
for each column FFT at a time, then immediately do that FFT.

9) Experiment with cache hints. On Opterons in particular we found
this to give a speedup in FLINT. The current code indicates where
cache hints should be used, though combine this strategy with 8.

10) Replace any ulong's which should really be mp_limb_t's so that
things will work on Windows.

FFT_combine*:

11) Again it might be possible to combine just a single coefficient at
a time so that cache locality can be maintained for the MFA IFFT's.
One has to be slightly careful to avoid a quadratic run time, but for
unsigned coefficients I think it will be OK. One also needs to
selectively zero the limbs of the output integer in step with these
coefficient combines I would guess. So that step should probably be
combined with this somehow. Needs careful thought.

12) Pass in temporary space from the main integer multiplication routine.

mpn_submod_i_2expp1:

13) Remove as it is only used in the split radix code, which is not utilised.

mpn_lshB_sumdiffmod_2expp1:
mpn_sumdiff_rshBmod_2expp1:

14) The name's are stupid and should be changed. Too specific to be
mpn functions.

15) Is there some symmetry which can be utilised to make the functions
simpler. There seem to be a lot of cases.  I don't think there is, as
one is doing a subtraction, however combining with a possible negation
without increasing the code complexity might be possible. Certainly
the case x = 0 is the (I)FFT butterfly case so surely some
consolidation is possible there.

16) Can the entire functions be done in assembly? Either way, the
overheads need to be reduced as much as possible, and we still need a
generic C implementation.

mpn_mul_2expmod_2expp1:
mpn_div_2expmod_2expp1:

17) The first function should probably just deal with all cases,
whether bit shifts, or including limb shifts, positive or negative. I
ended up implementing FFT and negacyclic_twiddle functions because
this didn't support limb shifts, and these were needed in the
truncated FFT. So some rationalisation is possible there. Do away with
the mpn_div_2expmod_2expp1 function. Do right shift by d bits by doing
left shift by 2*n*w - d bits and make this function handle the
negation. But see how the FFT_twiddle and FFT_negacyclic_twiddle
functions are implemented which return an int to indicate if they did
their operation in-place or not to save copying data around when it is
not necessary. Perhaps just pass the temporary in as a parameter and
do the swap internal to the function.

18) Can this function be optimised in assembly?

FFT_split_radix_butterfly:
FFT_split_radix_butterfly2:

19) Remove as no longer utilised.

FFT_radix2_twiddle_butterfly:
FFT_radix2_twiddle_inverse_butterfly:

20) Surely this can be combined with the ordinary
(IF)FFT_radix2_butterfly functions. However note that it is better to
pass nw than to pass n and w separately as we do elsewhere. Perhaps
rename this parameter to nw or limbs, not NW.

21) The radix2 in the name is redundant, as we only implement radix 2,
and other radices are impractical. Omit the twiddle from the name.

22) The motif:

   b1 %= (2*NW);
   if (b1 >= NW)
   {
      negate2 = 1;
      b1 -= NW;
   }
   x = b1/GMP_LIMB_BITS;
   b1 -= x*GMP_LIMB_BITS;

occurs often in this and other functions, and should be adapted away
into a macro or something like that. It's not necessarily optimised
either.

FFT_radix2_butterfly:
FFT_radix2_inverse_butterfly:

23) Surely better to pass nw and an actual bit shift rather than i,
since otherwise n*w needs to be computed over and over, or at least
i*w does.

24) In FFT_radix2_butterfly, I don't think i >= n is actually
possible, so remove that.

FFT_split_radix:

25) Remove, as it is not utilised.

FFT_radix2:

26) Remove as it is no longer used.

FFT_radix2_new:
FFT_radix2_twiddle:
IFFT_radix2_new:
IFFT_radix2_twiddle:

27) radix2 is superfluous in the name, as is the new.

28) combine the non-twiddle functions with the twiddle versions.

29) I don't believe the result stride rs is ever used, and we always
do the FFT and IFFT in-place, so remove rr and rs entirely.

30) It is no longer necessary to pass temp. It isn't used.

31) Can t1 and t2 be combined without a loss of speed? Please
benchmark and be *absolutely sure*.

32) The motif:

      ptr = rr[0];
      rr[0] = *t1;
      *t1 = ptr;

occurs frequently here and elsewhere and should definitely be adapted
away into a pointer swap macro, or use one of the already provided
MPIR macros for this.

FFT_negacyclic_twiddle:
FFT_twiddle:

33) Remove, see 17. However note that an indication of how to combine
the negation is provided in commented out code (which is not tested).

FFT_radix2_truncate1:
FFT_radix2_truncate:
FFT_radix2_truncate1_twiddle:
FFT_radix2_truncate_twiddle:
IFFT_radix2_truncate1:
IFFT_radix2_truncate1_twiddle:
IFFT_radix2_truncate:
IFFT_radix2_truncate1_twiddle:

34) Combine non-twiddle and twiddle versions. Remove radix2 from name.

35) Remove rr and rs. Do in-place.

36) Correct code comments which are cut and pasted from elsewhere.

37) Write special cases for n = 1, 2 to remove the unnecessary
restriction that trunc be divisible by 2. Remember to remove the
restriction in the multiplication code which sets trunc to a multiple
of 2*sqrt instead of just a multiple of sqrt.

FFT_radix2_negacyclic:
IFFT_radix2_negacyclic:

38) Remove rr and rs. Always work in-place.

39) Do not pass temp.

FFT_radix2_mfa:
FFT_radix2_mfa_truncate:
IFFT_radix2_mfa:
IFFT_radix2_mfa_truncate:

40) Do not pass temp.

41) Combine truncate and non-truncate versions. Actually the
non-truncate versions are not really needed.

FFT_mulmod_2expp1:

42) Memory allocation should be done in a single block, or even passed
in as a temporary.

43) Clean up the code which handles the subtractions and make it cache
friendly.

new_mpn_mulmod_2expp1:

44) Doesn't currently deal with the case that c != 0 when using the
negacyclic convolution (which it currently never uses anyway because
it is too inefficient), i.e. the case where one of the inputs is
2^{n*w} so that it doesn't fit into {n*w}/GMP_LIMB_BITS limbs.

45) Doesn't currently return any carry, i.e. when the result is p-1
there is a carry.

new_mpn_mul_mfa:

46) In the memory allocation part, one shouldn't mix allocations of
arrays of mp_limb_t's and mp_limb_t *'s as this is not necessarily
safe on all machines. Or is it?

General:

47) The code needs to be made Windows safe by avoiding variable
declarations except at the beginnings of blocks.

48) Some functions don't have adequate test functions.

49) I think the normalise function is correct, but the test code
doesn't check all cases.

50) The test code doesn't test the FFTs, IFFTs and multiplication
functions for lots of different sizes, especially different values of
n, w, and trunc.

Bill.

2010/1/3 Bill Hart <[email protected]>:
> 5. References.
>
> I am indebted to the people who prepared the following references:
>
> "Matters Computational: ideas, algorithms and source code", by J org
> Arndt, see http://www.jjj.de/fxt/fxtbook.pdf
>
> Specifically I used chapter 3. Note that I took very great care to not
> refer to the source code. I found the notation used in equations such
> as 19.2-6 very helpful, and this book was also the resource I used for
> a description of the Matrix Fourier Algorithm and the split radix
> algorithm (which did not actually end up being used by my integer
> multiplication implementation).
>
> "Primes numbers: a computational perspective", by Richard Crandall and
> Carl Pomerance, 2nd ed., 2005, Springer.
>
> Specifically I used this reference to figure out why my negacyclic
> convolution didn't work (I had forgotten that the coefficients would
> be signed).
>
> Two other references which I referred to, but which did not end up
> featuring in the implementation I settled on (so far), but to which I
> am grateful to the authors for providing are:
>
> "A GMP-based implementation of Schonhage-Strassen's Large Integer
> Multiplication Algorithm" by Pierrick Gaudry, Alexander Kruppa and
> Paul Zimmermann, ISAAC 2007 proceedings, pp 167-174. See
> http://www.loria.fr/~gaudry/publis/issac07.pdf
>
> and
>
> "Multidigit multiplication for mathematicians" by Dan Bernstein (to
> appear). see http://cr.yp.to/papers/m3.pdf
>
> Specifically this paper helped me sort out why my radix 2/3
> implementation didn't work. I would also say that the algebraic way of
> thinking about the FFT which is advocated in this paper also helped
> when it came to thinking up a truncation strategy. I had been looking
> at this paper to work out why my radix 2/3 thing didn't work and went
> for a walk to town in frustration. By the time I got back I had
> basically figured out how to do truncation without radix 3. The
> initial thought that set it off was to think about the FFT as breaking
> into a cyclic and negacyclic half, as described in section 2 of these
> notes.
>
> Bill.
>
> 2010/1/3 Bill Hart <[email protected]>:
>> Hmm, so the only reason MPIR can be beating me for small
>> multiplications is general coding efficiency. My butterfly strategy
>> probably introduces overheads when the coefficients are small, i.e.
>> only a few limbs.
>>
>> The overheads could be lowered by having an iterative FFT for short
>> lengths. But we should worry about that later. It will complicate the
>> code too much and the existing code needs consolidation first. There's
>> a lot of code duplication currently.
>>
>> Bill.
>>
>> 2010/1/3 Bill Hart <[email protected]>:
>>> 2010/1/3 David Harvey <[email protected]>:
>>>>
>>>>
>>>> On Jan 3, 9:03 am, Bill Hart <[email protected]> wrote:
>>>>
>>>>> I figured I probably did misunderstand. What I think they are trying
>>>>> to get at is that one can use coefficients modulo 2^N+1 except at
>>>>> certain levels because the roots of unity are actually the same. But
>>>>> then the discussion becomes one of combining a Fermat and Mersenne
>>>>> transform, which is the basic strategy that the implementation uses. I
>>>>> got confused. I still am.
>>>>
>>>> I think they are saying that to multiply two integers whose product
>>>> has N bits, choose a and b (as in Lemma 1 of their paper) such that a
>>>> + b ~ N and then multiply modulo 2^a - 1 (use Mersenne transform) and
>>>> modulo 2^b + 1 (use Fermat transform). The smoothness comes from the
>>>> freedom to choose a and b.
>>>>
>>>
>>> OK, I think I see. Their big N corresponds to an FFT for the original
>>> integers, which are multiplied mod 2^N+1 in the case of a Fermat
>>> transform or 2^N-1 in the case of a Mersenne transform. I think I
>>> would call these negacyclic and cyclic convolutions respectively.
>>>
>>> But then a Mersenne transform is nothing but an ordinary FFT, no?
>>>
>>> The idea of combining with a negacyclic FFT seems to be one of the
>>> strategies I discussed and discarded.
>>>
>>> I still don't understand what they mean by "power-of-2 roots of unity
>>> are needed only at the "lower level"". I understand that by the "lower
>>> level" they seem to mean the pointwise mults. But what is a power of 2
>>> root of unity? Aren't they all powers of 2, except when using the
>>> sqrt2 trick.
>>>
>>> Maybe I don't understand the Schonhage-Strassen algorithm correctly.
>>> Perhaps they not only worked with coefficients mod 2^B+1 but also did
>>> the convolution mod x^m+1. They were worried about asymptotic
>>> complexity, not speed, so they wanted their algorithm to recurse so
>>> that the pointwise mults could be done using the same algorithm. Gosh
>>> that is exactly what the authors of this paper say!
>>>
>>> So my misunderstanding came from the fact that I assumed SS advocated
>>> a Mersenne transform, i.e. mod x^m-1, not a Fermat transform, mod
>>> x^m+1.
>>>
>>> Well, just to be clear. I worked mod x^m-1 because it is more
>>> efficient, and in a ring mod p where p = 2^B+1, for which one must
>>> perform a negacyclic convolution if one wishes to recurse, which
>>> involves twiddling to get an ordinary cyclic convolution.
>>>
>>> Anyhow, this is all a non-strategy once you have truncation. All that
>>> is required is to implement a single transform which works with
>>> coefficients mod 2^B-1, which I was formerly calling a Mersenne ring,
>>> when the large integers being multiplied are "small". Then you don't
>>> want to recurse for doing the pointwise mults and so the pointwise
>>> mults can be done mod 2^B/2-1 and 2^B/2+1 using our mpn_mulmod_2expp1
>>> and mpn_mulmod_2expm1, which can do whatever is most efficient.
>>>
>>> Bill.
>>>
>>
>

--

You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en.

Re: [mpir-devel] Re: A new FFT for the New Year

Reply via email to