On 9 Apr 00, at 20:51, Stefan Struiker wrote:
> Working on nearly the same exponent, 9.7 million, I notice that
> our PIII 500 takes only 119.5 million clocks per iteration, whereas
> our 733EB requires 124.8 million to do the same.
The PIII 500/100 MHz FSB system equates to a PIII 666/133 MHz FSB
system, i.e. they _should_ take the same number of clocks. With a
higher multiplier, you will be waiting for extra clocks to access
data when it isn't in one of the caches. Even a PIII 550 in the same
MB as the PIII 500 would take _some_ extra clocks, though not enough
to cancel out the extra CPU speed. Basically the point is that, with
a relatively lower instruction & operand fetch rate relatively to the
execution unit performance, the CPU is going to spend more of its
time waiting for the pipeline to furnish new instructions and/or data
values to work on.
The smaller L2 cache probably doesn't help, even though it is running
faster. There is still a wastage of memory bandwidth whenever there
is a L2 cache miss, which is going to happen more often with a
smaller cache. However, this may not be too serious - 128K L2 cache
Celerons perform quite well, and L2 cacheless Celerons are nowhere
near as bad as you might suspect!
The chipset has a big effect, too. All systems using 133 MHz FSB run
the memory asynchronously to the system bus. One would suspect that
the hardware required to resynchronize might have a bad effect on the
throughput. The way my PIII 533B / TMC TI6NBF+ (VIA chipset) system
performs, using 133 MHz FSB & memory, leads me to suspect that
there's a 100 MHz bottleneck somewhere in the memory subsystem!
>
> The 733 has 256K full-speed on-chip cache and runs 256MB of RDRAM
> (RAMBUS), 133MHz bus and was not so cheap.
Wow, 256 MB RDRAM certainly _isn't_ cheap, I could buy a whole PIII
500 system with the money I'd save if I bought 256 MB SDRAM instead!
Seriously, though, systems with more physical memory have bigger page
tables, which can't help causing more system faults. Grossly
excessive amounts of physical memory hurt performance (a bit) rather
than help it. Depending on the OS you're running, how the program
fits into the memory available can make a big difference too. Using
Win 9x or NT WS 4.0, you can get up to 5% variation in iteration time
_on the exact same hardware, on the exact same exponent, without even
rebooting the system_. Starting Prime95 using my ReCache program
helps the OS to load Prime95 efficiently, especially on systems with
more than adequate physical memory. Win 2000 seems to manage well
enough without ReCache. linux systems don't seem to be bothered,
their run speed is consistent to within 1%, but adding extra memory
does still cause a modest slowdown.
>
> An Athlon of my acquaintance does all of this in 114.6 million clocks, but
> that's another story, because the Athlon is supposed to decode more
> instructions per cycle, and generally ace it in the floating-point
> department..
I have a Athlon 650 and yes, it flies - it's turning in performance
which equates to a PIII 733 extrapolated from George's benchmarks.
There are other factors than the "improved" CPU: the L1 cache is 32
KB instead of 16 KB & I have a sneaky suspicion that the LL testing
code manages to run with at least the vast majority of instruction
fetches being stored in L1, this reduces the cycles lost to L1 cache
miss. And the internal bus is 128 bits wide instead of 64 bits, so
you get more data shifted internally between CPU/L1 cache and (512
KB) L2 cache. This more than compensates for the lower L2 cache
speed.
>
> Can anyone explain the Pentium discrepancy?
Not entirely, but I believe the points I've outlined above go some
way.
Just out of interest, which MB are you using for the system with
RDRAM in it, and what's the spec of the RDRAM itself?
Regards
Brian Beesley
_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers