On 27 Nov 00, at 23:05, George Woltman wrote:
> One correction to my previous post. I said that the latency to
> access the L1 data cache was 2 clocks. This is correct for integer
> instructions only. For floating point and SSE2 instructions the latency
> is 6 clocks! Interestingly, the L2 cache latency is 7 clocks for both
> integer and floating point instructions.
Surely the latency is much less relevant than the throughput,
provided that there are sufficient registers and pipeline entries.
The idea is to keep the execution units busy ...
The Pentium 4 architecture has a deeper pipeline than the old PPro
architecture used up to Pentium III, which should help reduce
mispredicted branches. It also has a new feature, the "Execution
Trace cache", which stores up to 12K instructions in a ready-decoded
format, meaning they can be re-used without calling on the main
decoder. These features should help throughput - though whether this
is sufficient to offset the smaller L1 cache (especially in
comparison with Athlon) is arguable, and may well be application
dependent.
Presumably we get these performance enhancements, whether we choose
to run SSE2 or stick with x86 FPU code.
Bear in mind that there may be a severe penalty for switching between
SSE2 and x86 FPU formats. To avoid this, 100% conversion to SSE2 is
indicated. The assembler code is transparent, but there may be some
floating-point code generated by the C compiler which would need to
be attended to.
> A question for readers. Prime95 currently uses about 8MB (exponent
> around 11 million). How would you feel if the P4 optimized version
> used 13MB? 23MB? 33MB?
>
> I hate to code up 3 versions of the P4 code (small, medium, large),
> but it might be necessary.
I may not be fully up to speed on Intel's marketing policy, but I
gathered that Pentium 4 was targeted at high-end systems - which we
could expect to have "adequate" memory installed. The idea was that
Pentium 3 was to be retained for the lower end of the consumer
market. Bearing in mind that the existing "small memory footprint"
PPro code will execute on a Pentium 4 (albeit with a probably
significant performance penalty) I personally doubt that it's worth
worrying overmuch about the memory footprint of the P4 code -
especially if my suggestion that the P4 version is made available as
a seperate program, rather than being implemented as a parallel
instruction stream in Prime95.
The problem with this approach is that the 850 chipset currently
required to support P4 only supports Rambus (RDRAM). Ignoring the
performance argument, RDRAM is still very expensive compared with
SDRAM, therefore it is possible that P4 systems will initially be
delivered with rather less memory than would be expected in a top-end
system.
I feel we should also bear in mind that the people who read this list
are probably the more enthusiastic users - those who run Prime95 as a
background activity are less likely to put up with a memory hog than
those of us who run number-crunchers which double as general-purpose
PCs!
Regards
Brian Beesley
_________________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers