Jason Stratos Papadopoulos wrote (to Peter Montgomery):
>
> > Ernst Mayer and I exchanged many mailings about
> > using GF(p^2) when p = 2^61 - 1.
> > I thought he had implemented it, as part of a
> > mixed integer/complex FFT.
>
> As I remember, he *had* implemented it but the project is in limbo now for
> several reasons: first, the Alpha compiler wouldn't interleave the integer
> and floating point instructions, and also Ernst mentioned that
> non-power-of-two transforms in GF(p^2) had some subtle difficulty
> (something about the order that you applied the various radices being
> important). Last I heard he was embarking on implementing a mixed
> integer/complex FFT in assembly language, but that was quite a while ago
> and he likely has moved on.
>
Indeed I did implement a version for power-of-2 vector
lengths. But even on the Alpha 21264 it was much slower
than I expected, so, my time being limited, I shelved it
and preferred to work on enhancing the all-floating FFT
version, where the near-term payoff was much surer. Now
that I've squeezed out about as much as I think I can
be reasonably done from the floating-point version, I've
been thinking about the modular stuff again. Your
experience with all-modular convolution on the 21264
(I believe within a factor of 2 of the speed of floating
FFT is what you said) gives me hope that a pure-integer
over a field with nice arithmetic properties (such as
the complex integers GF(q^2) as you mentioned, and
choosing q a Mersenne prime makes modular arithmetic a
lot nicer) can compete with floating-point transform on
hardware with good integer support (multiply is the key
in most instances.) What happened to me, as you noted,
was a compiler which was awful at combining floating and
integer code.
The other "problem" I encountered, which you alluded to,
is that one of the really nice properties of arithmetic
over GF(q^2), namely that arithmetic modulo the gaussian
integers closely mirrors that over the complex numbers,
fails for non-power-of-2 runlengths, in the sense that
the conjugation property exploited by complex FFTs which
pack real inputs into half-length complex vectors no
longer holds for the modular data. The conjugates are
still there, but their locations in the transform
output are all screwed up, to use nontechnical language.
This probably can be overcome or at least mitigated to
some degree, but it didn't look like it was going to be
pretty when I first encountered it. But starting with
a simple power-of-2 runlength implementation and using
a good compiler (I plan to switch to C) should be a
worthwhile effort.
Guillermo Ballester Valor wrote:
<M<M
As a new version of Glucas program, I am considering the possibility to
work with M31=2^p -1 instead of M61. In this case p^2-1 is not as smooth
as M61^2 - 1, but the use of this "small" GF(M31^2) could give us some
advantages.
The modular mul M31, for 64 bits integer should be fast enough to
"fight" against the very fast FPU units. The modular add/sub is perhaps
faster than Floating point one. So, a complex integer arithmetic could
be as fast as complex float, a nice feature for processors with
independent integer-mul-unit and float-mul-unit. If the compiler does a
good job interleaving float and integer code, we can gain about 15
bits_per_limb with few cost, and so we should gain about 50%
performance.
>>
Choosing q = M31 may be better on architectures which
lack an instruction to obtain the upper 64 bits of a
128-bit product, but the hardware will still need to be
able to calculate 32 x 32 => 64-bit products quickly.
M31 also allows complex operands to be packed into 64-
bit fields, which allows adds to proceed in parallel
on the real and imaginary part (assuming the hardware
has a 64-bit adder.) Multiplies need to use unpacked
(i.e. separate) real and imaginary parts. M31 also needs
quite a few more modular reductions of intermediates
than does M61. M31 should be nice on 32-bit SIMD
hardware such as the Pentium MMX and G4 AltiVec units,
assuming floating-point instructions can keep executing
alongside (this may not be true of the MMX - not sure
about the G4). Assuming balanced-digit representation,
hybrid floating/M61 allows inputs ~30.5 bits larger than floating-only; hybrid
floating/M31 allows ~15.5 bits
more, but requires less overall memory bandwidth. On
good 64-bit hardware, it could be a toss-up, in which
case allowing either modulus would allow some
optimization of transform length for the number under
test, while still using just power-of-2 lengths and thus
avoiding the data-access-pattern headaches associated
with non-power-of-2 runlengths.
Oh, Peter found 6+I to be a primitive root of order
q^2-1 for both q = M31 and M61. Odd-order roots over
these fields have the nice property that they are pure
real, low-order power-of-2 roots (specifically, roots of
order 2, 4 and 8) mirror those of the complex FFT. For
power-of-2 runlengths, one can also use a nice closed-form expression due to
Creutzburg and Tasche
[1989] for the primitive root of order 2^(p+1) (where
q = 2^p - 1):
r = 2^[2^(p-2)] - I*(-3)^[2^(p-2)] (mod q).
Note that for p > 2, 2^(p-2) is always even, so we can replace the -3 with 3 in the
above formula.
It can also easily be shown that, for odd p, the real part of r is just 2^[(p+1)/2].
Sorry if this appears a bit cobbled-together: it is, as
I'm at work and lacking time for an elegant exposition.
More later (and not just from me, I'm sure)
-Ernst
_________________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers