On 27 Nov 00, at 23:05, George Woltman wrote:

>          One correction to my previous post.  I said that the latency to
> access the L1 data cache was 2 clocks.  This is correct for integer
> instructions only.  For floating point and SSE2 instructions the latency
> is 6 clocks!  Interestingly, the L2 cache latency is 7 clocks for both
> integer and floating point instructions.

Surely the latency is much less relevant than the throughput, 
provided that there are sufficient registers and pipeline entries.
The idea is to keep the execution units busy ...

The Pentium 4 architecture has a deeper pipeline than the old PPro 
architecture used up to Pentium III, which should help reduce 
mispredicted branches. It also has a new feature, the "Execution 
Trace cache", which stores up to 12K instructions in a ready-decoded 
format, meaning they can be re-used without calling on the main 
decoder. These features should help throughput - though whether this 
is sufficient to offset the smaller L1 cache (especially in 
comparison with Athlon) is arguable, and may well be application 
dependent.

Presumably we get these performance enhancements, whether we choose 
to run SSE2 or stick with x86 FPU code.

Bear in mind that there may be a severe penalty for switching between 
SSE2 and x86 FPU formats. To avoid this, 100% conversion to SSE2 is 
indicated. The assembler code is transparent, but there may be some 
floating-point code generated by the C compiler which would need to 
be attended to.

> A question for readers.  Prime95 currently uses about 8MB (exponent
> around 11 million).  How would you feel if the P4 optimized version
> used 13MB?   23MB?   33MB?
> 
> I hate to code up 3 versions of the P4 code  (small, medium, large),
> but it might be necessary.

I may not be fully up to speed on Intel's marketing policy, but I 
gathered that Pentium 4 was targeted at high-end systems - which we 
could expect to have "adequate" memory installed. The idea was that 
Pentium 3 was to be retained for the lower end of the consumer 
market. Bearing in mind that the existing "small memory footprint" 
PPro code will execute on a Pentium 4 (albeit with a probably 
significant performance penalty) I personally doubt that it's worth 
worrying overmuch about the memory footprint of the P4 code - 
especially if my suggestion that the P4 version is made available as 
a seperate program, rather than being implemented as a parallel 
instruction stream in Prime95.

The problem with this approach is that the 850 chipset currently 
required to support P4 only supports Rambus (RDRAM). Ignoring the 
performance argument, RDRAM is still very expensive compared with 
SDRAM, therefore it is possible that P4 systems will initially be 
delivered with rather less memory than would be expected in a top-end 
system.

I feel we should also bear in mind that the people who read this list 
are probably the more enthusiastic users - those who run Prime95 as a 
background activity are less likely to put up with a memory hog than 
those of us who run number-crunchers which double as general-purpose 
PCs!


Regards
Brian Beesley
_________________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

Reply via email to