On 26 Nov 00, at 23:25, [EMAIL PROTECTED] wrote:

[... snip ...]
>      I understand that the SSE2 instructions operate only on
> 64-bit (and 32-bit) floating point data, whereas the 
> FPU registers support 80-bit intermediate results.
> How will the loss of precision affect the FFT length?

We have good experimental data from several other processor 
architectures which use only IEEE 64-bit floating point. e.g. the FFT 
size
breaks in Mlucas are a bit smaller than in Prime95, e.g:

256K: Mlucas 5,150,000: Prime95, 5,250,000
320K: Mlucas 6,380,000: Prime95, 6,515,000
384K: Mlucas 7,580,000: Prime95, 7,730,000
448K: Mlucas 8,840,000: Prime95, 9,020,000
512K: Mlucas 10,110,000: Prime95, 10,320,000

Mlucas also checks for excess rounding error every iteration, so it 
is obvious if an over-aggressive run length is being used.
My experience so far on Alpha 21164 is that the Mlucas size breaks 
are OK, though I'm aware that other people have run into problems. 
Note that the math libraries supplied externally to Mlucas can vary 
in accuracy as well as any effect caused by different compilers or 
detailed differences in FPU hardware implementation.

I suggest most strongly that we don't want to reduce the size breaks in
Prime95 to suit the reduced accuracy inevitable with the implementation of
SSE2. 

I think the best way to proceed is to keep Prime95 using x86 floating-
point architecture (which _will_ run on a P4, subject to a probable loss
of efficiency) and have a new program with a different name using SSE2
floating-point throughout (which will obviously not run except on a P4).

Obviously the look, feel & PrimeNet interface of the new program 
could & should be the same! This implies that the extra cost of 
producing a new program should be negligible in comparison with the 
enormous effort involved in the creation, debugging & optimization of the
SSE2 FFT code.

This approach also facilitates arranging double-checking so that 
residues computed using x86 could be cross-checked using SSE2 and 
vice versa, should this be thought desirable.

>     Vector processors such as the Cray typically support both
> 
>               vector op vector -> vector
>               vector op scalar -> vector
> 
> opcodes, so one can (for example) form all b[i]^2 - 4.0*a[i]*c[i]
> when solving several quadratic equations.
> [We need two vector*vector multiplications, 
> one vector*scalar multiplication, 
> one vector-vector subtraction.]
> I find it strange that the MMX and XMM and SSE2 instruction sets
> lack vector*scalar operations and also lack a way to make multiple
> copies of the constant 4.0, other than to store multiple copies
> in memory.  While data replication in memory 
> (or adding a[i]*c[i] to itself twice) may be acceptable here, 
> we don't want multiple copies of the table of roots of unity,
> for example.

The omissions are indeed obvious and regrettable. I guess that the 
Intel engineers found them impractical to implement for some reason.
Either that, or they're saving up something for the Pentium 5 ...?


Regards
Brian Beesley
_________________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.exu.ilstu.edu/mersenne/faq-mers.txt

Reply via email to