Re: Mersenne: targeting multiply/add (Was: K7 vs. x86)

Peter-Lawrence . Montgomery Tue, 24 Aug 1999 19:30:07 -0700
Ernst Mayer writes:

- Hi, Jason:
- 
- >Or, if you just want to cheapen a complex multiply, turn the standard
- >form 
- >
- >   xr,xi * yr,ri ->   xr*yr-xi*yi, xr*yi+xi*yr
- >
- >into
- >
- >   xr,xi * yr,ri ->   xr*(yr - xi/xr * yi), xr*(yi + xi/xr * yr)
- >
- >and precompute xr and xi/xr. Presto! 2 multiplies and 2 multiply-adds.
- 
- An excellent suggestion - why rely on the compiler perhaps being smarter
- than it really is, when you drop a broad hint with very little modification
- of the normal code?
- 
- I tried the above in just the forward radix-16 FFT loop of my code, replacing
- sines with tangents in the precomputation of the FFT sincos data. No timing
- change on the MIPS R10K, indicating that the MIPSPro f90 compiler was likely
- already doing such a replacement for me. But on the Alpha 21064 (which has
- no MADD instruction) my times for large FFT lengths dropped about 5%!
- (I expect another 5% when I modify the inverse FFT similarly).
- Weird, but welcome. Any ideas how the above replacement might improve
- pipelineability of a twiddle-multiply/add/subtract sequence, assuming just
- FMUL and FADD are available?


    Compared to the straightforward coding, Jason's formula
uses the same instruction count but in a different order.

    Assume yr, yi, and xr, xi (or xr, xi/xr) are in registers, 
and we want the result to go into registers (zr, zi).
On a machine with multiply and multiply-(add/subtract/reverse subtract)
available, the translations might be

        Standard                      Jason
        temp1 = xi*yi                 temp1 = yr - (xi/xr)*yi
        temp2 = xi*yr                 temp2 = yi + (xi/xr)*yr
        zr = xr*xr - temp1            zr = xr*temp1
        zi = xr*yi + temp2            zi = xr*temp2


    Jason's method needs fewer temporaries on a machine with separate
add and multiply instructions.  The same register 
might be used for (xi/xr)*yi, temp1, and zr.  
The standard method needs separate temporaries for xr*yr and xi*yi.

    Jason's method lets the add instructions start earlier, during the
twiddle computation -- the standard method saturates the multipliers.
Jason's method may let the multiplies needed for zr and zi be combined 
with additions later in the code, on architectures with multiply-add.

    When (xr, xi) denotes   cos(theta) + i*sin(theta), 
Jason's method needs tables with cos(theta) and tan(theta).
The standard method lets one use sin(theta) = cos(pi/2 - theta)
to reduce table space for trigonometric functions.

    Jason's method needs special casing if xr might be zero, in which
case (xr, xi) is pure imaginary.

    I am not sure which method has more potential round-off error.

        Peter Montgomery
        [EMAIL PROTECTED]


_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: targeting multiply/add (Was: K7 vs. x86)

Reply via email to