Ernst Mayer writes:
- Hi, Jason:
-
- >Or, if you just want to cheapen a complex multiply, turn the standard
- >form
- >
- > xr,xi * yr,ri -> xr*yr-xi*yi, xr*yi+xi*yr
- >
- >into
- >
- > xr,xi * yr,ri -> xr*(yr - xi/xr * yi), xr*(yi + xi/xr * yr)
- >
- >and precompute xr and xi/xr. Presto! 2 multiplies and 2 multiply-adds.
-
- An excellent suggestion - why rely on the compiler perhaps being smarter
- than it really is, when you drop a broad hint with very little modification
- of the normal code?
-
- I tried the above in just the forward radix-16 FFT loop of my code, replacing
- sines with tangents in the precomputation of the FFT sincos data. No timing
- change on the MIPS R10K, indicating that the MIPSPro f90 compiler was likely
- already doing such a replacement for me. But on the Alpha 21064 (which has
- no MADD instruction) my times for large FFT lengths dropped about 5%!
- (I expect another 5% when I modify the inverse FFT similarly).
- Weird, but welcome. Any ideas how the above replacement might improve
- pipelineability of a twiddle-multiply/add/subtract sequence, assuming just
- FMUL and FADD are available?
Compared to the straightforward coding, Jason's formula
uses the same instruction count but in a different order.
Assume yr, yi, and xr, xi (or xr, xi/xr) are in registers,
and we want the result to go into registers (zr, zi).
On a machine with multiply and multiply-(add/subtract/reverse subtract)
available, the translations might be
Standard Jason
temp1 = xi*yi temp1 = yr - (xi/xr)*yi
temp2 = xi*yr temp2 = yi + (xi/xr)*yr
zr = xr*xr - temp1 zr = xr*temp1
zi = xr*yi + temp2 zi = xr*temp2
Jason's method needs fewer temporaries on a machine with separate
add and multiply instructions. The same register
might be used for (xi/xr)*yi, temp1, and zr.
The standard method needs separate temporaries for xr*yr and xi*yi.
Jason's method lets the add instructions start earlier, during the
twiddle computation -- the standard method saturates the multipliers.
Jason's method may let the multiplies needed for zr and zi be combined
with additions later in the code, on architectures with multiply-add.
When (xr, xi) denotes cos(theta) + i*sin(theta),
Jason's method needs tables with cos(theta) and tan(theta).
The standard method lets one use sin(theta) = cos(pi/2 - theta)
to reduce table space for trigonometric functions.
Jason's method needs special casing if xr might be zero, in which
case (xr, xi) is pure imaginary.
I am not sure which method has more potential round-off error.
Peter Montgomery
[EMAIL PROTECTED]
_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers