Hi,
At 07:45 AM 11/28/00 +0000, Brian J. Beesley wrote:
>Surely the latency is much less relevant than the throughput,
>provided that there are sufficient registers and pipeline entries.
True. The manuals are unclear as to how many XMM registers are
available. There are 8 user visible ones and an unknown number
of internal ones that you use via the P4's register renaming feature.
>The idea is to keep the execution units busy ...
I just finshed designing the new code for one of the easier problems -
the rounding and carry propagation step after the inverse FFT.
It looks like the P4 can do this in 6.5 clocks per double vs. about
19 clocks per double on the P3. This doesn't count any benefits
from prefetching data.
It also seems that the multiply units are idle enough in the new
code so that I can continue to use the compact two-to-fractional-power
arrays, rather than two full 5MB arrays. Thus, yesterday's question of
33MB max is probably 23MB max for a 640K FFT. That's good, because
I'd feel guilty using up 33MB. Heck, I feel guilty using 23MB.
I don't expect the above savings (3 times fewer clocks) to hold for
the FFT code. The old FFT code was much better pipelined than the old
carry propagation code.
>Bear in mind that there may be a severe penalty for switching between
>SSE2 and x86 FPU formats.
There is no penalty on the P4. The XMM registers do not overlap the
FPU registers in the same way that MMX registers do.
Having fun,
George
_________________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers