Re: [mpir-devel] Re: A new FFT for the New Year

Bill Hart Sun, 03 Jan 2010 06:03:45 -0800

2010/1/3 David Harvey <[email protected]>:
>
> On Jan 2, 4:13 pm, Bill Hart <[email protected]> wrote:
>
>> 2) The MPIR FFT uses both a Fermat FFT and a Mersenne FFT, i.e. it
>> does a multiplication of polynomials with coefficients mod p with p =
>> 2^{aN}+1 then one with coefficients mod p = 2^N-1. Then it CRT's the
>> results. By varying the parameter _a_, as well as the FFT length, it
>> is possible to get quite a lot of control over the zero padding
>> wastage. This is essentially their strategy for dealing with issue 3
>> above.
>
> You are misunderstanding what they wrote. They explicitly say that the
> Mersenne multiplication can be done at the top level only. They do not
> ever work modulo 2^N - 1 for the pointwise multiplications. This is
> probably not possible. See the recent discussion on the GMP list, for
> example the thread


I figured I probably did misunderstand. What I think they are trying
to get at is that one can use coefficients modulo 2^N+1 except at
certain levels because the roots of unity are actually the same. But
then the discussion becomes one of combining a Fermat and Mersenne
transform, which is the basic strategy that the implementation uses. I
got confused. I still am.

>
> http://gmplib.org/list-archives/gmp-devel/2009-October/001053.html
>
>> In their paper:
>>
>> http://www.loria.fr/~gaudry/publis/issac07.pdf
>>
>> they claim on page 171:
>>
>> "a Mersenne transform can use a larger FFT length K = 2^k than the
>> corresponding Fermat transform. Indeed, while K must divide N for the
>> Fermat transform, so that theta = 2N/K is a power of two, it only
>> needs to divide 2N for the Mersenne transform, so that omega =
>> 2^(2N/K) is a power of two. This improves the efficiency for K near
>> sqrt(N), and enables one to use a value of K close to optimal."
>>
>> However I do not understand this argument. Clearly in a ring mod 2^M+1
>> we have that 2 is a 2M-th root of unity. So convolutions of length 2M
>> can be supported. However in a ring mod 2^M-1 we have that 2 is only
>> an M-th root of unity and so only convolutions of length M can be
>> supported.
>
> They are saying that for multiplication modulo 2^N - 1, you can use a
> longer transform (and smaller coefficients) than for a multiplication
> modulo 2^N + 1.

Sorry to be obtuse, but I am still confused.

What is modulo 2^N - 1, the coefficients of the FFT? Or are we
multiplying large integers modulo 2^N - 1.

>
>> There is no advantage to using a = 2 in MPIR. In that situation one
>> could just use a single FFT with coefficients of three times the size
>> and then break 2^{3B}+1 up into 2^B+1 and 2^{2B}-2^B+1 and CRT
>> together. In the schoolbook range this is more efficient and so there
>> is no gain for pointwise multiplications in MPIR's FFT.
>
> You will recall I did this in FLINT and it worked quite well. Except
> that it got very complicated, and the code I wrote had a bug that only
> got picked up by the spooks years later.

Yes. It's an old idea. Jason Moxham implemented it many years ago, and
I think this was one of the things he actually posted to the GMP list
a few years before we started on FLINT. Dan Bernstein seemed to think
the idea went back much further.

But yes, you also came up with it an implemented it in FLINT.

>
>> Well, I could carry the analysis through to an algorithm which *does*
>> work. But I'll leave that for the next iteration of our
>> implementation. Basically there is a strategy which will also beat
>> MPIR by a long way when the multiplications are small.
>
> I look forward to hearing about it.
>
>> Another strategy I thought about was to do a negacyclic FFT and a
>> cyclic FFT of a different size and CRT together. For example one could
>> do one FFT mod x^m-1 with coefficients mod 2^B+1 and one FFT mod
>> x^{2m}+1 with coefficients mod 2^B+1.
>
> This is mathematically equivalent to a truncated FFT of length 3m.

Yes.

>
>> Another strategy I tried (and fully coded), was the split radix FFT.
>> Roughly speaking, one starts mod x^{4m}-1 and splits into x^m-1, x^m+1
>> and x^{2m}+1. Thus one half of the recursion becomes half the length
>> whilst the other half of the recursion is split into quarter sized
>> FFT's.
>>
>> I was utterly amazed to find that when I had optimised both a radix 2
>> and split radix FFT to the same degree, the timings I got were
>> identical. When using complex FFT's the split radix FFT actually has a
>> lower operation count (less roots of unity to apply). But the
>> operation count in a Fermat ring is the same. However one thought that
>> at least the overheads would be lower, more combining of operations
>> could be done using the new mpn functions in MPIR or maybe the cache
>> locality would be improved. But nada. Nothing. Zippo. Zilch. So I
>> abandoned this idea too.
>
> The split-radix FFT obtains its speedup because multiplication by i or
> -i is faster than multiplication by other numbers. In the Fermat ring
> modulo 2^N + 1, multiplication by i corresponds to a shift by N/2
> bits. But this is absorbed into all the other work, assuming that you
> are merging shifts as we did in the FLINT implementation. It doesn't
> surprise me that there wasn't any measurable speedup.
>
>> The next idea I had was to handle the truncation issue by using radix
>> 2 and radix 3 steps. So long as you have 3^j-th roots of unity (or so
>> I thought) you can perform a radix 3 FFT. My strategy was to do all
>> the top levels of the FFT at radix 2, cutting the length of the FFT in
>> half with each recursion. Once the roots of unity corresponded to
>> shifts by a multiple of a limb, I would switch to radix 3. I reasoned
>> that you needed less layers of radix 3 to get you down to the
>> pointwise mults than you would need of radix 2. Because you can use
>> both radix 2 and radix 3 steps, your FFT could be any length of the
>> form 2^k*3*j. That effectively solved the truncation problem.
>>
>> One can manufacture cube roots of unity by working in a ring 2^{3B}+1.
>> Here I could try using 2^B as a cube root of unity.
>>
>> I went ahead and implemented the *entire thing*, including a kind of
>> Bailey's 4-step for cache locality. I even got timings for it, and
>> they were very good. Only one problem. When I multiplied integers with
>> it, I got the wrong answer. Coding bug? No!
>>
>> The problem is this. When doing a radix 3 FFT, everything proceeds as
>> per a radix 2 FFT for the FFT side of things. And it works. It
>> evaluates at 3^j-th roots of unity.
>>
>> However, let's suppose you have a polynomial mod x^n-1, x^n-w and
>> x^n-w^2 where w is a cube root of unity. If your polynomial mod
>> x^{3n}-1 is a(x)+b(x)*x^n+c(x)*x^n you will have d1 = a+b+c, d2 =
>> a+w*b+w^2*c and d3 = a+w^2*b+w*c.
>>
>> Now how would I retrieve a, b and c? Clearly I can get 3*a by adding
>> d1, d2 and d3....
>>
>> Or can I?
>>
>> This is actually where things fall down. In order to get 3*a I need
>> that b + w*b + w^2*b = 0 and c + w^2*c+w*c = 0. In other words, I need
>> that 1 + w + w^2 = 0.
>>
>> Well when w is a complex number, this holds. But in my ring mod
>> 2^{3B}+1 it doesn't hold. Go ahead try it. It doesn't work.
>>
>> So in order to do a radix 3 FFT one actually needs to work in Z/pZ
>> where p = 2^{2B}+2^B+1. Then the roots of unity work as desired.
>
> Right, this idea is due to Schonhage (1977, "Schnelle Multiplikation
> von Polynomen über Körpern der Charakteristik 2").

I could probably have saved myself a lot of trouble by realising that
this was the main reason he had the idea and that of course he would
have applied it to Schonhage-Strassen rings if it were any better.

>
>> But now there is a problem. The binary representation of p has 3
>> binary 1's. That means that to reduce mod p takes two subtractions,
>> not 1. In other words, the amount of time doing butterflies is
>> basically doubled. It's worse than that though. We are now working mod
>> a value p which is 2/3 of the size of what we wanted. Not only is that
>> so, but each butterfly has 6 operations instead of the 2 that we had
>> for a radix 2 FFT. In every way, radix 3 is useless in Fermat rings
>> (it's fine in rings of characteristic 2 and other rings where 2 is not
>> invertible).
>>
>> So despite all the effort I put into this, it too went on the
>> scrapheap. And I realised I was a nitwit.
>>
>> But by now I already had too much code to throw it in the garbage. So
>> I sat down and thought about how else I might solve the truncation
>> problem. By this stage I had already got something efficient to solve
>> all the other issues (except for the negacyclic convolution, which is
>> only important for really huge multiplications).
>>
>> So that is why I pushed on and wrote the current FFT.
>>
>> In the next section I will describe what I actually ended up doing.
>> It's not too surprising given the options actually available.
>
> I look forward to reading it.
>
> david
>
> --
>
> You received this message because you are subscribed to the Google Groups 
> "mpir-devel" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to 
> [email protected].
> For more options, visit this group at 
> http://groups.google.com/group/mpir-devel?hl=en.
>
>
>

--

You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en.

Re: [mpir-devel] Re: A new FFT for the New Year

Reply via email to