4. The strategy used.

I'll now describe the strategy I ended up settling on. But first some
notation to make this simpler.

Suppose I have a convolution length 2*n where n is a power of 2. A
standard Fermat transform with coefficients mod p = 2^{w*n}+1 can
support such a transform length. Here 2^w is a 2n-th root of unity.

Note that multiplication by a root of unity is then just a shift by a
multiple of w bits.

To make things easier we assume that w*n is divisible by
GMP_LIMB_BITS. This means that reduction mod p just involves a
subtraction without any bitshifts needed. Only limb shifts are needed.
This is actually a non-condition, as n has to be a fairly high power
of 2 before doing an FFT would be efficient. For the sizes we consider
in practice, n is always divisible by GMP_LIMB_BITS.

Let's write _limbs_ for (w*n)/GMP_LIMB_BITS. So each coefficient mod p
can be stored in a block of _limbs_ consecutive limbs in memory. The
only exception to this is p-1 = 2^{w*n} which takes an extra bit. As
we did in FLINT, this FFT stores coefficients in blocks of _limbs_ + 1
consecutive twos complement limbs. The extra limb is used to
accumulate carries and so no reduction mod p actually needs to be done
after an addition or subtraction. At the end of the FFT and IFFT
stages we can do a single reduction mod p, saving time (hopefully).

I also store the coefficients with a set of pointers to each so that
coefficients can be swapped without copying the actual data around.

I pass in a couple of extra temporary coefficients for operations that
cannot be performed in place, such as limb shifts. I then just swap
them out when the operation is done. All of this is pretty standard
stuff by now.

So let's go through each of the important problems an FFT must solve
and see how it is done in this FFT.

1. Efficient butterflies.

Recall the FFT butterfly step:

[a{i}, b{i}] => [a{i}+b{i}, z^i*(a{i}-b{i})].

The idea is to combine as much as possible into one operation. The
addition and subtraction can be performed using one call to MPIR's
mpn_sumdiff, except for that pesky left shift by i*w bits (the
multiplication by z^i). To get around this, I decompose i*w into a bit
shift by b bits, where 0 <= b < GMP_LIMB_BITS and a limb shift by x
limbs, say. Now we can still use sumdiff, except that we need to shift
where the output of the subtraction goes by x limbs. Of course now
we'll overflow the second output by x limbs, as we've shifted by x
limbs. So we decompose the sumdiff into two parts, a part which won't
overflow, and a part consisting of x limbs which will overflow. The
part that overflows, we reduce mod p by changing its sign (simply swap
the order of a and b when supplying them to sumdiff) and placing the
output of the sumdiff in the low x limbs of the second output.

Finally we shift just the second output by the remaining b bits. The
only tricky issue is when i >= n in which case we need to reduce i by
n and negate the second output (as n*w = -1 mod p). At this stage I
have not combined this extra negation, but there is no reason why it
couldn't be combined, saving quite a lot of time, as it happens often.

Note that the sumdiff operation is quite cache friendly. Instead of
reading a and b twice, once for the addition and once for the
subtraction, it reads them only once. Thus when individual
coefficients do finally fall out of cache there should be an
advantage.

Also recall the FFT butterfly step:

[a{i}, b{i}] => [a{i}+z^-i*b{i}, a{i}-z^-i*b{i}].

Here we again decompose i*w into a bitshift b and limb shift x. We
first perform the bitshift of the second coefficient by b, in-place.
Then we do the same sumdiff trick as we did for the FFT butterfly, but
this time shifting the second *input* by x limbs. I.e. we start
reading the second input from limb x. We again have to split into two
parts, the part before the wraparound, consisting of _limbs_ - x limbs
and the part from the bottom of b{i}, which is x limbs.

It would be interesting to see if further assembly optimisation was
possible, e.g. by combining the negation and even the bitshift into
single operations, mpn_FFT_butterfly and mpn_IFFT_butterfly. The
speedup could be tremendous! I'm sure when Jason returns to work on
the assembly optimisation this will be of interest to him.

I recall David spent a lot of time thinking about combining operations
when he worked on the FLINT butterflies, although the same mpn
primitives were not available at that time.

2. Improving cache locality.

As indicated in the previous sets of notes, eventually the FFT becomes
so large that the data no longer fits in L1 cache, and eventually not
even in L2 cache.

The strategy I used to improve cache locality is called the Matrix
Fourier Algorithm. It works as follows:

   Suppose you want to do an FFT of length m = n1*n2:

   * Split the coefficients into R rows of C columns (here C = n1, R =
m/n1 = n2)
   * Perform a length R FFT on each column, i.e. with an input stride of n1
   * Multiply each coefficient by z^{r*c} where z = exp(2*Pi*I/m),
note z=>w bits
   * Perform a length C FFT on each row, i.e. with an input stride of 1
   * Transpose the matrix of coefficients

The IFFT is basically the complete reverse of this, using IFFT's
instead of FFT's.

Note that at no stage is an FFT or IFFT of length longer than the
longest of n1 and n2 performed. Thus if we set n1 and n2 to both be
around the sqrt(m) we radically reduce the maximum amount of data that
needs to be in cache.

Also note that as we are going to follow the FFT by an IFFT which does
the operations in reverse, there is absolutely no point in doing the
transposes for the FFT or IFFT. So these steps are omitted.

Also, we do not wish to go through all of the coefficients and apply
the twists by z^{r*c} as that would add another access to all the
data. So we combine the extra twiddles with the length R FFT's and
length R IFFT's. These can be combined with the butterflies at the
very bottom level of the FFT's and IFFT's and actually cost basically
nothing. These are just extra bit and limb shifts which we merge with
the ones we are already doing at these bottom level butterflies.

I actually tried doing a 3D Matrix Fourier implementation, but it
failed to give an improvement except for a very slight almost
unmeasurable one over a certain range of sizes. In general it made
things slower. It's trivial to do, it just doesn't seem to work. As
soon as things fall out of L2 cache at the top end on my machine,
things soon fall out of L1 cache at the bottom end and so 2D is
sufficient. It's also simpler to deal with the code this way, so I've
left it at a 2D MFA for now. On some platforms, perhaps this is not
optimal.

I am not clear on the distinction between the MFA and Bailey's 4-step.
They look isomorphic to me, and perhaps are.

One slight snafu is that the algorithm expects FFT's to output their
coefficients in reverse binary order (if you have a length 8 FFT for
example, 8 is 2^3, so coefficient 011 in 3 bits of binary will be
swapped with coefficient 110, etc.) To ensure that the order is
correct after the column FFT's and before the column IFFT's I just
revbin the coefficients, using pointer swapping.

3. Truncation.

We want to save as much zero padding as possible in the FFT's and
IFFT's. It would also be nice to do proportionally fewer pointwise
multiplications when some of our output coefficients are going to be
zero.

I imagine that the strategy I came up with is just van der Hoeven's
truncated FFT/IFFT, which is also what FLINT uses. However I haven't
read the paper and have not worked through the FLINT implementation of
this, so can't be sure at this point. I have technically seen David
give talks on it, but I think in each instance I was madly working
away on my laptop, so never absorbed any of the details.

First for the FFT. We start with an FFT some of whose coefficients are
zero (we don't bother writing out these zeros, we just know they are
conceptually there).

There are two cases: a) less than or equal to half of the coefficients
are non-zero and b) more than half the coefficients are non-zero:

a) A 0 0 0

b) A A A 0

In the first case, the first layer of the FFT would do nothing. As we
only care about the left half, we recurse on only the left half A 0,
ignoring the remaining zeros.

In the second case we compute the first layer of the FFT. We then do
an ordinary FFT on the left half and recurse with a truncated FFT on
the right half.

Of course we need to be careful in that the first layer of the FFT
will have replaced our zeroes with non-zero coefficients, so we don't
recurse to the above two cases.

We start instead with an FFT with non-zero coefficients (labelled B).

A B B B

or

A A A B

But the cases can be dealt with in a similar way to the cases where
there are zeros. The saving comes in that we repeatedly ignore
coefficients on the right hand side when they are all past the
truncation point.

Now the IFFT is slightly more involved. We know that we are going to
*end up with* zeroes on the right hand side. We start with the results
of the pointwise mults, though we do not perform all the pointwise
mults. If we are going to end up with c zeroes, we do not perform the
last c pointwise mults.

So we want our IFFT to get to

A A A 0

starting from

P P P ?

Again there are two cases, depending on how many zeros we will end up with:

a) A 0 0 0

b) A A A 0

In case (a) , by working backwards from what we know we will get, the
next to last level must be

A/2 0 (A/2)~ 0 where ~ is the opposite of the twiddle that will be
applied by the IFFT butterfly.

But I can compute the LHS, A/2 0, simply by recursing on the truncated
IFFT. Then it is just a matter of multiplying by 2 to get A 0 which is
what I was after.

In case (b) an ordinary IFFT can compute the left hand of the
penultimate layer, as we have all the necessary pointwise mults for
that.

A A A 0
B B ? ?

The right hand side we compute by recursing on the truncated IFFT. But
we don't know what the final question mark is. To get it we have to
reverse the steps of the IFFT to find it. As we have the second B we
can compute the second A simply by performing some IFFT butterflies.
Now we can compute the second ? by reversing the IFFT butterflies. So
we are left with:

A A' A 0'
B' B' ? C'

where I wrote a dash on the coefficients we actually now know.

Now we can recurse using the truncated IFFT on the right hand side.

Although the coefficients C' are not zero, the principles are the same
and we split into two cases as above.

This allows us to get the question mark, yielding:

A A' A 0'
B' B' C' C'

and clearly now we can compute the A's we don't know from the known
coefficients.

The final trick is how to combine this basic truncation strategy with
the Matrix Fourier Algorithm.

Here I decided to take a shortcut for simplicity. We simply truncate
the top level FFT's done in the MFA and set the truncation length to
be a multiple of the length of the lower level FFT's (doing a small
amount of zero padding if necessary), i.e. we truncate to a multiple
of the row length. When we are at the lower levels, computing the row
FFT's we just don't compute the FFT's which lie past the truncation
level.

This makes truncation slightly grainy in that the truncation must now
be a multiple of roughly the square root of the FFT length. However it
probably isn't that noticeable in the timings, and it is very simple
to implement and understand conceptually.

I should point out that in FLINT, as I understand it, David
implemented an infinite dimensional Bailey's 4 step with truncation
right down through all the layers of the Bailey's. Well, that is my
understanding anyhow. David can correct me if I am wrong. So this
strategy is probably just a simplification of that strategy.

The only snafu with this strategy is that because we don't transpose
our matrix and don't output the coefficients in revbin order, we need
to be careful to do the right pointwise multiplications, otherwise the
result will be meaningless. But again our revbin function to the
rescue. As only full FFT's and IFFT's are done at the lower levels
(the row FFT's and IFFT's nearest to the pointwise mults), it is just
a matter of working out which blocks of coefficients need to have
pointwise multiplications performed. That can be computed with a
single revbin.

4) Negacyclic convolution.

I implemented a very naive negacyclic convolution. It was too slow to
actually use (or at least I didn't find the cutoff). The issue is that
the coefficient sizes need to be a multiple of half the FFT lengths
and as things are quite constrained, this makes things very
inefficient. No analogue of the truncation idea exists here (although
that didn't stop this nitwit from thinking about one).

Lots more work needs to be done on the negacyclic convolution before
it can be used effectively in the new FFT. Fortunately it probably
only affects integer multiplications of half a billion bits or above.
So for now things aren't too bad, though it is not clear to me what
Jason's mpn_mulmod_2expp1 is actually doing for FFT sizes. For all I
know it actually uses the MPIR negacyclic convolution, but I doubt it.
I should check.

Bill.





2010/1/2 Bill Hart <[email protected]>:
> 3. Why a new FFT?
>
> The timings I just posted probably justify writing a new FFT. However
> that is not what concretely motivated me to write one.
>
> I began thinking about writing a new FFT about the end of September
> this year. At the time I skimmed through a huge number of papers on
> the FFT for something else I was working on, and this gave me some new
> ideas (some I haven't tried yet).
>
> Of course when you want to try out completely new ideas, you very
> often have to start from scratch. The main thing which motivated me to
> start writing the current FFT a couple of weeks back was a desire to
> try and write a radix 3 FFT. This has little to nothing in common with
> an ordinary radix 2 FFT and thus I had to begin afresh.
>
> Of course the radix 3 FFT didn't work out. More about that shortly.
>
> Let me first discuss some of the issues a good FFT implementation has
> to address.
>
> 1) Fast butterflies. The main part of the FFT and IFFT routines is the
> butterfly. These need to be coded efficiently, otherwise the whole
> thing slows down.
>
> 2) Cache locality. The standard iterative radix 2 FFT is very cache
> unfriendly. It wants to access all of the coefficients about k times
> where the length of the convolution is about 2^k. For decent sized
> problems, each coefficient might be many thousands of bits, and you
> might have thousands of them. So in effect you are drawing
> coefficients from main memory, not cache, over and over. This is very
> expensive.
>
> The recursive FFT is much more cache friendly. It keeps halving the
> size of the problem and then concentrates on each half separately
> before progressing to the other half. This keeps things more localised
> and in cache. However it has its limits. The top levels of the FFT may
> be completely out of cache. For decent sized problems, maybe half of
> the layers are completely out of cache. Thus making the FFT recursive
> only fixes half the problem.
>
> 3) Truncation. If I multiply integers of different lengths, or even
> integers of the same length but which don't break nicely into 2^k
> coefficients of 2^{k-1} bits each, then the only way to multiply such
> integers with a standard FFT (whose length must be a power of 2) is to
> zero pad the integers, i.e. set a whole lot of the FFT coefficients to
> zero. This is completely inefficient and can cause the computation to
> take up to twice as long (e.g. if your integers are just a little too
> large to multiply using an FFT of a given length and you have to go up
> to one of twice the size).
>
> 4) Negacyclic convolution for large pointwise multiplications. As
> explained in an earlier section, when the pointwise multiplications
> themselves become so large that you want to use an FFT, you should use
> the negacyclic convolution, as it will take about half the time.
>
> Originally I had intended to port the FLINT FFT to MPIR. David Harvey
> and I worked on that for a very long time, then at some point I got
> interested in working on polynomial arithmetic instead of FFTs and
> David went on and developed the FFT in its current form in FLINT. I
> won't say I don't know how it works, but I certainly am not too
> familiar with all the code.
>
> The FLINT FFT does deal with all of the above issues. A lot of care
> was taken with the butterflies, cache locality was improved with
> Bailey's 4 step (an infinite dimensional version), truncation was
> taken care of with van der Hoeven's truncated FFT/IFFT and there was a
> negacyclic convolution (much more sophisticated than mine).
>
> So porting the FLINT FFT to MPIR certainly looked attractive. Even so,
> I had some problems with that. The code was quite complex, which made
> it difficult to check, understand, port, maintain and improve. That
> didn't pose an insurmountable problem, but it is what put me off
> porting it. It also wasn't written with inclusion in MPIR in mind (it
> was written before MPIR existed), so much of it would have to be
> rewritten anyhow. It was spread over multiple files, which is great
> for FLINT, but not so comfortable for MPIR. So, whilst tempted, I
> decided not to port the FLINT FFT. The main thing that tipped the
> balance though was the desire to try some different ideas.
>
> So why not just improve the MPIR FFT? It's a great implementation, the
> paper about it is very nice and we are extremely grateful to the
> authors of it for making it available so that we can use it.
>
> But the MPIR FFT has some serious problems, and this is really what
> set me off with the idea to try and come up with something better.
>
> 1) It is insanely hard to tune. There are so many tuning parameters
> and it takes hours and hours per architecture. We still haven't tuned
> it for some old architectures, and for none of the ones we don't have
> access to. Tuning it on Windows appears to be near impossible. This
> point is underlined by the crazy times I just posted. For some sized
> integers it picks totally ridiculous parameters and ends up taking
> much longer than if it were multiplying shorter integers.
>
> 2) The MPIR FFT uses both a Fermat FFT and a Mersenne FFT, i.e. it
> does a multiplication of polynomials with coefficients mod p with p =
> 2^{aN}+1 then one with coefficients mod p = 2^N-1. Then it CRT's the
> results. By varying the parameter _a_, as well as the FFT length, it
> is possible to get quite a lot of control over the zero padding
> wastage. This is essentially their strategy for dealing with issue 3
> above.
>
> In their paper:
>
> http://www.loria.fr/~gaudry/publis/issac07.pdf
>
> they claim on page 171:
>
> "a Mersenne transform can use a larger FFT length K = 2^k than the
> corresponding Fermat transform. Indeed, while K must divide N for the
> Fermat transform, so that theta = 2N/K is a power of two, it only
> needs to divide 2N for the Mersenne transform, so that omega =
> 2^(2N/K) is a power of two. This improves the efficiency for K near
> sqrt(N), and enables one to use a value of K close to optimal."
>
> However I do not understand this argument. Clearly in a ring mod 2^M+1
> we have that 2 is a 2M-th root of unity. So convolutions of length 2M
> can be supported. However in a ring mod 2^M-1 we have that 2 is only
> an M-th root of unity and so only convolutions of length M can be
> supported.
>
> So let's work through an example (the one given earlier in their paper
> when discussing efficiency):
>
> Fermat ring:
> N=1,044,480 bits
> break into K=2^10 chunks each (i.e. k = 10)
> 2N/K+k = 2050 bits per output coefficient
> coefficients must be a multiple of 1024 bits
> must use coefficients of 3*1024 = 3072 bits
>
> Mersenne ring:
> 1,044,480 bits
> break into K=2^10 chunks each (i.e. k = 10)
> 2N/K+k = 2050 bits per output coefficient
> coefficients must be a multiple of 2048 bits
> must use coefficients of 2*2048 = 4096 bits
>
> So for convolutions of the same length, the efficiency actually goes
> down for a Mersenne ring, not up. I'm probably missing something, but
> I don't get the argument.
>
> The other issue is this. Suppose we CRT a convolution with
> coefficients mod 2^M-1 and one modulo 2^M+1. Even if the Mersenne
> convolution uses the longest length convolution it can support, then
> the Fermat ring (if it uses a convolution of the same length) is then
> constrained to use a convolution of half the maximum it can support,
> i.e. the Mersenne FFT constrains the efficiency of the Fermat FFT. So
> if we set a = 1 in MPIR, we automatically lose something.
>
> However, when multiplying smaller integers, this loss of efficiency is
> actually offset by something. Had we used only a single FFT and not
> CRT'd two FFT's together, we would be forced to do pointwise
> multiplications of twice the size. Now if those pointwise
> multiplications take up much of the time and use schoolboy
> multiplication instead of an FFT then their time is essentially
> quadratic in the size. So I think that in fact, the reason the MPIR
> FFT works at all well is that it can use pointwise multiplications of
> half the size.
>
> There is no advantage to using a = 2 in MPIR. In that situation one
> could just use a single FFT with coefficients of three times the size
> and then break 2^{3B}+1 up into 2^B+1 and 2^{2B}-2^B+1 and CRT
> together. In the schoolbook range this is more efficient and so there
> is no gain for pointwise multiplications in MPIR's FFT.
>
> Well, I could carry the analysis through to an algorithm which *does*
> work. But I'll leave that for the next iteration of our
> implementation. Basically there is a strategy which will also beat
> MPIR by a long way when the multiplications are small.
>
> Another strategy I thought about was to do a negacyclic FFT and a
> cyclic FFT of a different size and CRT together. For example one could
> do one FFT mod x^m-1 with coefficients mod 2^B+1 and one FFT mod
> x^{2m}+1 with coefficients mod 2^B+1. The issue with this strategy is
> that to do a negacyclic FFT one first needs to twiddle all the
> coefficients, so it is strictly less efficient than a cyclic FFT. But
> by far the major killer is that twice as many roots of unity are
> needed to effect a negacyclic FFT of the same length as are needed for
> a cyclic FFT. This again constrains the efficiency. One cannot do the
> cyclic FFT mod x^{2m}-1 and the negacyclic one mod x^m+1 as these two
> polynomials are not coprime and thus one cannot CRT. So this
> possibility went on the scrapheap of ideas.
>
> Another strategy I tried (and fully coded), was the split radix FFT.
> Roughly speaking, one starts mod x^{4m}-1 and splits into x^m-1, x^m+1
> and x^{2m}+1. Thus one half of the recursion becomes half the length
> whilst the other half of the recursion is split into quarter sized
> FFT's.
>
> I was utterly amazed to find that when I had optimised both a radix 2
> and split radix FFT to the same degree, the timings I got were
> identical. When using complex FFT's the split radix FFT actually has a
> lower operation count (less roots of unity to apply). But the
> operation count in a Fermat ring is the same. However one thought that
> at least the overheads would be lower, more combining of operations
> could be done using the new mpn functions in MPIR or maybe the cache
> locality would be improved. But nada. Nothing. Zippo. Zilch. So I
> abandoned this idea too.
>
> The next idea I had was to handle the truncation issue by using radix
> 2 and radix 3 steps. So long as you have 3^j-th roots of unity (or so
> I thought) you can perform a radix 3 FFT. My strategy was to do all
> the top levels of the FFT at radix 2, cutting the length of the FFT in
> half with each recursion. Once the roots of unity corresponded to
> shifts by a multiple of a limb, I would switch to radix 3. I reasoned
> that you needed less layers of radix 3 to get you down to the
> pointwise mults than you would need of radix 2. Because you can use
> both radix 2 and radix 3 steps, your FFT could be any length of the
> form 2^k*3*j. That effectively solved the truncation problem.
>
> One can manufacture cube roots of unity by working in a ring 2^{3B}+1.
> Here I could try using 2^B as a cube root of unity.
>
> I went ahead and implemented the *entire thing*, including a kind of
> Bailey's 4-step for cache locality. I even got timings for it, and
> they were very good. Only one problem. When I multiplied integers with
> it, I got the wrong answer. Coding bug? No!
>
> The problem is this. When doing a radix 3 FFT, everything proceeds as
> per a radix 2 FFT for the FFT side of things. And it works. It
> evaluates at 3^j-th roots of unity.
>
> However, let's suppose you have a polynomial mod x^n-1, x^n-w and
> x^n-w^2 where w is a cube root of unity. If your polynomial mod
> x^{3n}-1 is a(x)+b(x)*x^n+c(x)*x^n you will have d1 = a+b+c, d2 =
> a+w*b+w^2*c and d3 = a+w^2*b+w*c.
>
> Now how would I retrieve a, b and c? Clearly I can get 3*a by adding
> d1, d2 and d3....
>
> Or can I?
>
> This is actually where things fall down. In order to get 3*a I need
> that b + w*b + w^2*b = 0 and c + w^2*c+w*c = 0. In other words, I need
> that 1 + w + w^2 = 0.
>
> Well when w is a complex number, this holds. But in my ring mod
> 2^{3B}+1 it doesn't hold. Go ahead try it. It doesn't work.
>
> So in order to do a radix 3 FFT one actually needs to work in Z/pZ
> where p = 2^{2B}+2^B+1. Then the roots of unity work as desired.
>
> But now there is a problem. The binary representation of p has 3
> binary 1's. That means that to reduce mod p takes two subtractions,
> not 1. In other words, the amount of time doing butterflies is
> basically doubled. It's worse than that though. We are now working mod
> a value p which is 2/3 of the size of what we wanted. Not only is that
> so, but each butterfly has 6 operations instead of the 2 that we had
> for a radix 2 FFT. In every way, radix 3 is useless in Fermat rings
> (it's fine in rings of characteristic 2 and other rings where 2 is not
> invertible).
>
> So despite all the effort I put into this, it too went on the
> scrapheap. And I realised I was a nitwit.
>
> But by now I already had too much code to throw it in the garbage. So
> I sat down and thought about how else I might solve the truncation
> problem. By this stage I had already got something efficient to solve
> all the other issues (except for the negacyclic convolution, which is
> only important for really huge multiplications).
>
> So that is why I pushed on and wrote the current FFT.
>
> In the next section I will describe what I actually ended up doing.
> It's not too surprising given the options actually available.
>
> Bill.
>
> 2010/1/1 Bill Hart <[email protected]>:
>> 2. FFT's for nitwits (like me)
>> ====================
>>
>> I will explain why I am a nitwit in a later section. But to understand
>> that, you need to understand some FFT basics first. So I describe them
>> here.
>>
>> 2a. Description of the FFT
>> ===================
>>
>> An FFT is not an integer multiplication routine per se, but it can be
>> used to construct one.
>>
>> The first step in multiplying huge integers A and B is to view the
>> problem as a polynomial multiplication. Write your large integers in
>> some base D (usually D = 2^r for some r), i.e. write A = f1(D), B =
>> f2(D) for polynomials f1 and f2 whose coefficients are in the range
>> [0, D).
>>
>> Next, multiply the polynomials g(x) = f1(x)*f2(x).
>>
>> Finally, evaluate the product polynomial at D, i.e. C = A*B = f1(D)*f2(D).
>>
>> So from now on I'll talk about large integer multiplication as though
>> it is polynomial multiplication.
>>
>> Polynomial multiplication is computed using convolution of the two
>> polynomials. (Cyclic) convolution of polynomials is defined (up to
>> scaling) as multiplication of the polynomials modulo x^n - 1 for some
>> n. Note that if the degree of g is less than n then g(x) can be
>> recovered precisely from the convolution.
>>
>> The FFT can be used to efficiently compute the convolution.
>> Specifically convolution of two polynomials f1 and f2 can be computed
>> as FFT^(-1)(FFT(f1).FFT(f2)) where the dot is the coordinate-wise
>> product of vectors. The inverse of an FFT can be computed (up to
>> scaling) by an IFFT. Thus for convolution modulo x^n - 1 where n =
>> 2^k, the convolution can be computed as 2^(-k)*IFFT(FFT(f1).FFT(f2)),
>> where all the operations are on vectors of coefficients. In particular
>> the coordinate-wise product involves multiplication of coefficients
>> which are in [0,D). These products are known as pointwise
>> multiplications.
>>
>> You should view the FFT as evaluating the polynomial at n = 2^k
>> different roots of unity, 1, z, z^2, ...., z^{2k-1}, where z is a
>> primitive 2^k-th root of unity. The pointwise products multiply such
>> values, effectively giving you the product polynomial g (modulo x^n -
>> 1) evaluated at each of these roots of unity. The coefficients of the
>> polynomial g are retrieved by inverting the FFT process, i.e. by
>> interpolating the vector of pointwise products.
>>
>> So how does the FFT actually work?
>>
>> We start with a polynomial modulo x^n - 1 where n = 2^k. To save on
>> notation, let's write x^(2m) - 1 where m = 2^(k-1).
>>
>> We write x^(2m) - 1 = (x^m - 1)(x^m + 1). If we wish to multiply
>> polynomials modulo x^(2m) - 1, our first step is to express our
>> polynomials modulo x^m - 1 and x^m + 1. We then compute the products
>> mod x^m - 1 and mod x^m + 1 and recombine using CRT.
>>
>> Let f(x) = b(x)*x^m + a(x) where the degree of  a(x) and b(x) are less
>> than m, then f(x) modulo x^m - 1 is just a(x) + b(x), i.e. we simply
>> add coefficients in pairs. Similarly modulo x^m + 1 the polynomial
>> would be a(x) - b(x).
>>
>> Taking the product of polynomials modulo x^m - 1 can then be computed
>> recursively using the same algorithm, i.e. by a convolution modulo x^m
>> - 1/
>>
>> To compute the product modulo x^m + 1 we make a substitution y = z*x
>> where z is a primitive 2^k-th root of unity. This transforms the
>> problem into one modulo y^m - 1, which can again be done recursively
>> using the same algorithm. Note that substituting y = z*x is equivalent
>> to multiplying the coefficient x^i of f(x) by z^i. So we twiddle the
>> coefficients by roots of unity.
>>
>> A convolution mod x^m - 1 is called a cyclic convolution, and a
>> convolution mod x^m + 1 is called a negacyclic convolution. So the FFT
>> algorithm works by breaking the cyclic convolution into a cyclic
>> convolution of half the size and a negacyclic convolution of half the
>> size and transforming the negacyclic convolution into a cyclic
>> convolution by twiddling by roots of unity.
>>
>> In practice the evaluation step, adding or subtracting coefficients in
>> pairs, and the twiddle step for the negacyclic half, are combined.
>> This leads to the familiar FFT "butterfly" which replaces the left
>> half of a vector of coefficients of f(x), [a0, a1, a2, ...., a{m-1},
>> b0, b1, ..., b{m-1}] with [a0+b0, a1+b1, ...., a{m-1}+b{m-1}] and the
>> right half by [a0-b0, z(a1-b1), z^2*(a2-b2),...,
>> z^{m-1}*(a{m-1}-b{m-1}]. The butterfly step can be conveniently
>> written [a{i}, b{i}] => [a{i}+b{i}, z^i*(a{i}-b{i})].
>>
>> One then recursively computes the FFT butterfly of the left and right
>> halves of the vector. Eventually one gets down to vectors of length 1
>> which represent polynomials modulo x-1 (or really x-z^i for i =
>> 0,..,2m). Note that taking a polynomial f(x) modulo x - z^i is the
>> same thing as evaluating it at z^i. So what the FFT really computes is
>> [f(1), f(z), f(z^2), ...., f(z^{2m})] (well perhaps not in that order,
>> but in some order - probably bit reversed order, i.e. z^i is swapped
>> with z^j where i is the reverse of j in binary, when they are
>> considered as binary numbers of log_2(2m) bits).
>>
>> This left-right FFT recursion scheme is known as a decimation in
>> frequency (DIF) FFT.
>>
>> Note that multiplication of polynomials modulo x-1 is simply integer
>> multiplication. These are the pointwise multiplications we spoke of
>> earlier. We simply multiply our vectors of coefficients pointwise.
>>
>> Now suppose I have a polynomial known modulo x^m - 1 and x^m + 1, how
>> can I invert this process and obtain it modulo x^(2m) - 1?
>> Specifically, if I have c{i} = a{i}+b{i} and d{i} = z^i*(a{i}-b{i}),
>> how can I retrieve a{i} and b{i}. Clearly the answer is to multiply
>> d{i} by z^{-i}, then adding it to c{i} yields 2*a{i} and subtracting
>> it from c{i} yields 2*b{i}. Notice the extra factor of 2 which has
>> crept in.
>>
>> So the IFFT butterfly step replaces [c0, d0, c1, d1, ...., c{m-1},
>> d{m-1}] with [c0+d0, c0-d0, c1+z^(-1)*d1, c1-z*d1, ....,
>> c{m-1}+z^{-(m-1)}*d{m-1}, c{m-1}-z^{-(m-1)}*d{m-1}]. Or we can write
>> the butterfly conveniently as [c{i}, d{i}] => [c{i}+z^-i*d{i},
>> c{i}-z^-i*d{i}].
>>
>> Note we have paired the odd indexed entries in our vector with the
>> even indexed entries. An FFT which has the butterfly step [c{i}, d{i}]
>> => [c{i}+z^i*d{i}, c{i}-z^i*d{i}] (note the change in sign of the
>> exponents on the twiddle factors to convert from an IFFT to an FFT) is
>> called a decimation in time (DIT) FFT.
>>
>> So our IFFT is simply a DIT FFT with the sign of the exponent of the
>> twiddle factors changed.
>>
>> Now, an important note for later is that the reason this all worked is
>> that we can invert the FFT process. Specifically c{i}+z^-i*d{i} =
>> 2*a{i}+b{i}-b{i} = 2*a{i}, and similarly we can retrieve 2*b{i}.
>>
>> One final note. These factors of 2 that creep in can be removed at any
>> stage. They are sometimes removed at each butterfly step, sometimes
>> all at once at the end of the IFFT.
>>
>> 2b. FFT algorithms
>> ==============
>>
>> There are multiple different ways of implementing the FFT algorithm in
>> practice, and a number of important additional algorithms used in
>> multiplying integers using an FFT. I will describe them here.
>>
>> a) Complex FFT. One can use complex numbers and floating point
>> arithmetic to multiply integers using the FFT. One has to be careful
>> that at each step not too much precision is being lost. For example,
>> one could work with double precision floating point numbers.
>>
>> Naturally it is inefficient to use complex numbers, as the polynomials
>> we are trying to multiply have only real coefficients. There are some
>> strategies for getting around this. One can exploit the symmetries of
>> the FFT to get FFT's which work on real values and take roughly half
>> the time/space. For example the Fast Hartley Transform works with only
>> real values, and there are various other variations on this theme.
>>
>> Note that roots of unity here are complex values, or for the fast
>> Hartley transform, real values which are linear combinations of sines
>> and cosines. So computing the twiddle factors can be expensive.
>>
>> The biggest issue with real/complex FFT's is the loss of precision.
>> They are really fast as they can use the native floating point
>> arithmetic of the processor, they just usually don't guarantee a
>> correct result, and aren't so fast or don't support very long
>> transforms if they do.
>>
>> b) Schonhage-Strassen FFTs. Schonhage and Strassen suggested working
>> in the ring Z/pZ where p is of the form 2^m+1. Here 2 is a 2m-th
>> (2^k-th) root of unity. Simply raise 2^m and you see that you get -1
>> in this ring, thus 2^(2m) must be 1 in this ring.
>>
>> Note that multiplication by a root of unity then becomes
>> multiplication by a power of 2 (modulo 2^m + 1), or a bitshift on a
>> binary computer. Also working in this ring is convenient, as reduction
>> modulo p can be done with a subtraction. In practice we can work in a
>> ring Z/pZ where p = 2^(a*m)+1 for some value a. Here 2^a is a 2m-th
>> (2^k-th) root of unity. A ring of this form is called a Fermat ring
>> and p a generalised Fermat number.
>>
>> Usually p is chosen to be a large prime that takes up numerous machine
>> limbs. For performance reasons a and m are usually chosen so that
>> reduction modulo
>> p does not involve any bitshifts, i.e. 2^(a*m) is a multiple of a limb shift.
>>
>> c) Mersenne Ring FFT. Work in a ring modulo p = 2^(b*m) - 1, again for
>> m a power of 2 and b arbitrary. Here 2 is an m-th root of unity,
>> reduction modulo p is an addition. Numbers p of this form are called
>> generalised Mersenne numbers.
>>
>> d) Number Theoretic Transforms (NTT's). Work in a ring Z/pZ where p is
>> a prime such that Z/pZ has lots of roots of unity. Usually p is chosen
>> to be a small prime that fits in a machine word. We usually start with
>> a prime p of the form p = a*n + 1 for some small value a and some
>> larger value n (for example a power of 2 so that reduction mod p has a
>> chance of being cheap). We find a primitive root r mod p, and then s =
>> r^a (mod p) by Fermat's Little Theorem is a primitive n-th root of
>> unity. Thus we can support a transform of length n on the ring Z/pZ.
>>
>> Many of the historically large integer products that have been
>> computed, e.g. in computation of many digits of pi, involved NTT's.
>> One can use a few of them, modulo different primes p (usually 2-5 of
>> them) and combine using CRT.
>>
>> The problems with NTT's are twofold. Firstly, once one gets beyond
>> about 5 of them, it becomes a significant proportion of the runtime to
>> do the CRT step. Also, in practice, NTT implementations don't compare
>> favourably with optimised Schonhage-Strassen implementations, being an
>> order of magnitude slower in practice. On a 32 bit machine, the
>> maximum transform length for a word sized prime can also be limiting
>> for very large problems.
>>
>> The main drawback with NTT's is that multiplication by a root of unity
>> is no longer a bit shift. One must also compute all the roots of
>> unity, which usually means keeping some table of them handy. For very
>> long transforms, the table itself may exceed the size of cache. Or one
>> can partially compute the table and then perform some additional
>> computations on-the-fly.
>>
>> -----------------
>>
>> Now I will discuss some additional algorithms which one needs to know about.
>>
>> e) Multiplication of integers modulo B^m + 1, using the negacyclic
>> convolution. Recall that the negacyclic convolution can be used to
>> multiply polynomials modulo x^n+1. If we evaluate such a polynomial at
>> some point x = B then we can use a negacyclic convolution to multiply
>> integers modulo B^n + 1. There's nothing special about using this
>> strategy to multiply integers. A cyclic convolution will generally be
>> faster, as one doesn't need the extra twiddle factors to turn the
>> negacyclic convolution into a cyclic convolution (so it can be
>> computed using an FFT). But for multiplying integers modulo a number
>> of the form B^n + 1 it is very useful.
>>
>> Note that the pointwise multiplications in our convolution algorithm
>> are precisely multiplications modulo a number of the form B^m + 1,
>> when we are working in a Fermat ring. Thus we can effect our pointwise
>> multiplications using a negacyclic convolution. This allows the FFT to
>> recurse down to an FFT for the pointwise multiplications (when they
>> become sufficiently large). This is more efficient than multiplying
>> using an ordinary cyclic convolution and then reducing modulo p, as a
>> convolution of half the length can be used (you actually *want* it to
>> wrap around). This can save a significant amount of time once the
>> pointwise multiplications become so large that the use of an FFT is
>> indicated. If fact, because the convolution will be half the size, a
>> negacyclic convolution can be used *before* one would consider using
>> an ordinary FFT routine to multiply the integers.
>>
>> That's all I need to discuss for this section, as I will explain some
>> other algorithms in later sections as they are needed.
>>
>> Bill.
>>
>> 2010/1/1 Bill Hart <[email protected]>:
>>> 1. Formal release of code.
>>>
>>> I have included two new files in the top source directory of mpir in
>>> svn. The first is new_fft_with_flint.c, the second new_fft.c. They are
>>> isomorphic except that the second does not need to be linked against
>>> FLINT to run. There will be no difference in timings or results, I
>>> just removed all the timing and test code which relied on FLINT.
>>>
>>> Here is the command I use to build the code once you have built MPIR:
>>>
>>> gcc -O2 new_fft.c -o fft -I/home/wbhart/mpir-trunk
>>> -L/home/wbhart/mpir-trunk/.libs -lmpir
>>>
>>> Modify as desired for your own system.
>>>
>>> Please note the usual disclaimer:
>>>
>>> This code is INCOMPLETE, INEFFICIENT, INCONSISTENT often INCORRECT,
>>> ILL-TESTED and the original author is ILL-TEMPERED.
>>>
>>> It is LGPL v2+ licensed and I have taken extreme care to ensure that
>>> all the code was written from scratch by me without following someone
>>> else's work. The four functions FFT_split_* and FFT_combine_* were
>>> taken from FLINT, but to my knowledge were written entirely by me. I
>>> hereby release them under the LGPL v2+.
>>>
>>> The timing code does not print times, it merely completes a certain
>>> number of iterations and exits. Use the linux time command to get
>>> times.
>>>
>>> I have run very few tests on the multiplication routine, perhaps on
>>> the order of a few thousand integer multiplications, at very few
>>> sizes. This code is extremely new and should be treated as quite
>>> likely broken.
>>>
>>> Note that the negacyclic convolution is currently disabled. I don't
>>> even know if it passes its test code when switched on at present. I
>>> modified a number of the FFT's since writing it, so probably not. But
>>> it did pass it's test code. However note it does not deal with a
>>> couple of (trivial) special cases, so is incomplete and formally
>>> incorrect because of that.
>>>
>>> There is an implementation of the split radix FFT contained in the
>>> file too. I found the times were *identical* to those I got from the
>>> radix 2 code, so I abandoned it. There is also an old implementation
>>> of the radix 2 FFT/IFFT which I also abandoned. The basic radix 2 FFT
>>> function is called FFT_radix2_new.
>>>
>>> All the truncate, mfa and twiddle functions are new in that they don't
>>> rely on the old radix 2 implementation, I believe.
>>>
>>> Many of the code comments have been cut and pasted from earlier code
>>> they were derived from, so will be incorrect. This is especially true
>>> of the comments for the FFT and IFFT functions (including the truncate
>>> and mfa functions), except for the radix2_new and split_radix
>>> functions which I hope are about right.
>>>
>>> Bill.
>>>
>>> 2010/1/1 Bill Hart <[email protected]>:
>>>> As the code for this FFT is quite involved, I plan to incrementally
>>>> release a set of notes about it, starting tomorrow morning (today
>>>> already), which may take hours to days for me to complete. I got the
>>>> code working for the first time this afternoon, so please be patient
>>>> with me.
>>>>
>>>> Here is the rough list of topics I hope to cover in the notes:
>>>>
>>>> Table of Contents
>>>> =============
>>>>
>>>> 1) The formal code release. I need to check all the test code still
>>>> passes and make it easy for people to build and run. That will take an
>>>> hour or so for me to do.
>>>>
>>>> 2) FFT's for nitwits (like me). How does an FFT help in multiplying
>>>> numbers, basically, and what are the important algorithms and
>>>> techniques.
>>>>
>>>> 3) Why a new FFT? This is an important question which needs answering.
>>>> I will discuss what led me to work on a new FFT implementation, rather
>>>> than improve or port another implementation.
>>>>
>>>> 4) The strategy. What does this FFT actually do? Which algorithms does
>>>> it use?  I tried many things, most of which completely failed. In
>>>> section 3 I will have already discussed other strategies I tried which
>>>> did not work. Here I will discuss what did work. Whilst many of the
>>>> details came from my head, I'm pretty sure the general outline will
>>>> contain few surprises.
>>>>
>>>> 5) References. Few. I purposely tried to be creative.
>>>>
>>>> 8) Implementors notes. A long list of things that can be improved. In
>>>> releasing the code publicly it is my hope that contributors will help
>>>> make this the best FFT implementation it can be. The list will include
>>>> many trivial items which anyone could work on, all the way through to
>>>> fairly advanced topics which would probably require original research
>>>> to do really well. I'll rank the items according to how hard I think
>>>> they'd be to do.
>>>>
>>>> The code will be under revision control, so it is my hope that people
>>>> will be willing to contribute to it. I believe I spent between 150 and
>>>> 200 hours writing it, all told.
>>>>
>>>> Till tomorrow.
>>>>
>>>> Bill.
>>>>
>>>> 2010/1/1 Bill Hart <[email protected]>:
>>>>> Ha! I forgot to mention, those numbers represent the size of the
>>>>> integers in bits. I multiply two different integers with the same
>>>>> number of bits, though that is not a limitation of the code. It will
>>>>> accept integers of different lengths and adjust accordingly.
>>>>>
>>>>> Bill.
>>>>>
>>>>> 2010/1/1 Bill Hart <[email protected]>:
>>>>>> I re-timed that point and I believe I may have just mistranscribed it.
>>>>>> Here it is with the corrected time. Thanks for pointing it out!
>>>>>>
>>>>>> I've also included some times from the FLINT FFT, though only
>>>>>> non-truncated ones (i.e. where FLINT uses a convolution which is an
>>>>>> exact power of 2 length - not a limitation of FLINT, I just didn't get
>>>>>> around to timing it at the other points yet).
>>>>>>
>>>>>> 8364032:  6.8s / 8.0s
>>>>>> 9409536:  9.8s / 8.7s
>>>>>> 10455040:  10.8s / 9.6s
>>>>>> 11500544: 12.3s / 12.6s
>>>>>> 12546048: 12.7s / 15.3s
>>>>>> 13591552: 14.0s / 14.2s
>>>>>> 14637056: 14.9s / 15.3s
>>>>>> 15682560: 16.0s / 14.5s
>>>>>> 16728064: 16.8s / 19.5s
>>>>>>
>>>>>> 20910080: 24.3s / 23.2s
>>>>>> 25092096: 28.5s / 33.1s
>>>>>> 29274112: 33.1s / 31.8s
>>>>>> 33456128: 37.4s / 47.7s
>>>>>>
>>>>>> 519168: iters 10000: 30.7s / 32.2s / 32.3s
>>>>>> 2084864: iters 1000: 13.9s / 15.2s / 13.0s
>>>>>> 8364032: iters 100: 6.8s / 8.0s / 7.6s
>>>>>> 33497088: iters 100: 37.4s / 47.7s / 37.5s
>>>>>> 134103040: iters 10: 17.0s / 24.8s / 19.6s
>>>>>> 536608768: iters 1: 8.6s / 9.4s / 8.6s
>>>>>> 2146959360: iters 1: 42.3s / 40.3s / 38.9s
>>>>>>
>>>>>> Bill.
>>>>>>
>>>>>>
>>>>>> 2009/12/31 Bill Hart <[email protected]>:
>>>>>>> It's either a tuning issue or timing instability on the machine.
>>>>>>>
>>>>>>> Bill.
>>>>>>>
>>>>>>> On 31/12/2009, Robert Gerbicz <[email protected]> wrote:
>>>>>>>> Hi!
>>>>>>>>
>>>>>>>> How is it possible that to multiple longer numbers takes less time, for
>>>>>>>> example:
>>>>>>>>
>>>>>>>> 12546048: 14.7s / 15.3s
>>>>>>>> 13591552: 14.0s / 14.2s
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>>> Groups
>>>>>>>> "mpir-devel" group.
>>>>>>>> To post to this group, send email to [email protected].
>>>>>>>> To unsubscribe from this group, send email to
>>>>>>>> [email protected].
>>>>>>>> For more options, visit this group at
>>>>>>>> http://groups.google.com/group/mpir-devel?hl=en.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

--

You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en.


Reply via email to