Hi again,
One correction to my previous post. I said that the latency to
access the L1 data cache was 2 clocks. This is correct for integer
instructions only. For floating point and SSE2 instructions the latency
is 6 clocks! Interestingly, the L2 cache latency is 7 clocks for both
integer and floating point instructions.
At 11:25 PM 11/26/00 +0100, [EMAIL PROTECTED] wrote:
> I understand that the SSE2 instructions operate only on
>64-bit (and 32-bit) floating point data, whereas the
>FPU registers support 80-bit intermediate results.
>How will the loss of precision affect the FFT length?
Brian's post is correct. This will have a minor affect on the maximum
exponent each FFT size can handle. The reason the effect is minor
is that prime95 stores FFT data in memory, converting to 64-bit format.
> Vector processors such as the Cray typically support both
> vector op vector -> vector
> vector op scalar -> vector
>I find it strange that the MMX and XMM and SSE2 instruction sets
>lack vector*scalar operations and also lack a way to make multiple
>copies of the constant 4.0, other than to store multiple copies
>in memory. While data replication in memory
>may be acceptable here,
>we don't want multiple copies of the table of roots of unity,
>for example.
There may be no alternative to multiple copies of the roots of unity.
A question for readers. Prime95 currently uses about 8MB (exponent
around 11 million). How would you feel if the P4 optimized version
used 13MB? 23MB? 33MB?
I hate to code up 3 versions of the P4 code (small, medium, large),
but it might be necessary.
Regards,
George
_________________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ -- http://www.exu.ilstu.edu/mersenne/faq-mers.txt