Hi again,

         One correction to my previous post.  I said that the latency to
access the L1 data cache was 2 clocks.  This is correct for integer
instructions only.  For floating point and SSE2 instructions the latency
is 6 clocks!  Interestingly, the L2 cache latency is 7 clocks for both
integer and floating point instructions.

At 11:25 PM 11/26/00 +0100, [EMAIL PROTECTED] wrote:
>      I understand that the SSE2 instructions operate only on
>64-bit (and 32-bit) floating point data, whereas the
>FPU registers support 80-bit intermediate results.
>How will the loss of precision affect the FFT length?

Brian's post is correct.  This will have a minor affect on the maximum
exponent each FFT size can handle.   The reason the effect is minor
is that prime95 stores FFT data in memory, converting to 64-bit format.

>     Vector processors such as the Cray typically support both
>               vector op vector -> vector
>               vector op scalar -> vector
>I find it strange that the MMX and XMM and SSE2 instruction sets
>lack vector*scalar operations and also lack a way to make multiple
>copies of the constant 4.0, other than to store multiple copies
>in memory.  While data replication in memory
>may be acceptable here,
>we don't want multiple copies of the table of roots of unity,
>for example.

There may be no alternative to multiple copies of the roots of unity.

A question for readers.  Prime95 currently uses about 8MB (exponent
around 11 million).  How would you feel if the P4 optimized version
used 13MB?   23MB?   33MB?

I hate to code up 3 versions of the P4 code  (small, medium, large),
but it might be necessary.

Regards,
George

_________________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.exu.ilstu.edu/mersenne/faq-mers.txt

Reply via email to