On Sun, 26 Sep 1999 [EMAIL PROTECTED] wrote:

> I'm not too surprised at this. Since my code appears to be faster than
> FFTW on most high-end CPUs, that tells me that FFTW is probably optimized
> more for the x86 (very few FP registers) than mine, which is geared toward
> hardware with at least 32 FPR's. Normally this means that one uses smaller
> complex FFT radices (say, 4 or 8) on machines with 8-16 FPR's, but Jason
> Papadopoulos tells me that FFTW uses radices as high as 32. Perhaps they
> use conditional compilation to decide what radices to use depending on the
> underlying hardware, and use smaller radices on x86. (Any insights, Jason?)
> 
> If they do use radices >= 8 on x86, they are probably arranging the code
> to minimize register pressure - this could be worth looking at.

FFTW uses a recursive FFT, and includes code that performs a single
radix-n pass for lots of n (all powers of two up to 64, and other
small radices like 3,5,7,9,10,...). You tell it to build you an FFT
of a certain size, and it picks the combination of radices that solve
your problem in minimum time (it uses dynamic programming, finding the
fastest combination for small sizes and then using them as building 
blocks for larger sizes). The combination is encoded in a "plan",
and the FFTW executor reads this when you want to do the FFT for real.

On x86, for medium-size transforms FFTW likes radix-8 a lot. It overflows
the pathetic FPU stack quickly but it also cuts the problem size very
quickly, and so turns out to be a net win over smaller radices.

jasonp

_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

Reply via email to