Jeff,

You never stated which motherboard chipset/northbridge is in your system.  I
have essentially an identical system (Nvidia 680i), but with CPU x12
(3200MHz), air cooled, and get substantially better results.  The Corsair
dominator I have run well @ FSB 1066/linked, with 2.2v for memory banks.
Unlinked memory speeds are measurably slower unless they hit a "sweet"
timing ratio with the FSB.

What will help a lot if you have not done it (the mailing list is silent on
this issue) is to set "cpu affinity" for each of your P95 threads (can do
from task manager with a right click)

anyhow,
Paul

-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On
Behalf Of Jeff Woods
Sent: Friday, December 21, 2007 9:38 AM
To: The Great Internet Mersenne Prime Search list
Subject: [Prime] Benchmark question re: multiple cores

Hello All,

My second (and hopefully last) question after returning to the GIMPS 
fold. I've never unsub'd from the mailing list, though, and haven't seen 
this question come across....

I have a Quad Core QX6700, which is my first multi-core system, and my 
first overclocked system. It has just finished up P-1 stage 2 factoring 
and begun LL tests on each of its cores.

http://www.mersenne.org/bench.htm

The benchmark page above says that a Core 2 Quad QX6700 should be 
running about 0.0569 per iteration on a 2560K FFT. I assume that's at 
default clocks.

My system seems to be averaging less than half that, between 0.105 and 
0.131, depending on which core.... and I can't figure out why. What 
should take 24 days is going to take 60 days at this rate.

The following may be irrelevant, but I'm wondering....

The native 2.66 Ghz CPU is overclocked to 3.47 Ghz, with Vista 64 (and 
Prime95's 64-bit version) so I can have full access to all of 4 GB of 
very fast RAM. The RAM is Corsair Dominator PC2-8500 (800 Mhz native, 
5-5-5-18-2T) and for extra memory bandwidth I've overclocked the RAM to 
1000 Mhz. The CPU overclocking is all done via multiplier (10X -> 13X) 
and voltage, and the RAM overclock is done by configuring the 
motherboard's RAM speed to be unlinked from the FSB, then manually 
increasing the RAM speed to its max stable level of 1000 Mhz. I did not 
alter the FSB speed at all, since my RAID controller doesn't like FSB 
speed tweaks. Nor did I alter the memory timings, leaving them at the 
native 5-5-5-18-2T. The system is sufficiently cooled (liquid, 
sustaining 59C-63C at 100% load on all 4 cores) and passed overnight 
torture testing without throwing up errors.

With the above tweaks, the system throws up 12500 3DMarks, and scores an 
overall 5.9 on Vista's Experience Rating (the best one can score). It 
was scoring 5.8 prior to my increase of the memory speed from 800 Mhz to 
1000 Mhz, so my bump should have boosted memory bandwidth. Alas, 
matching 1:1 and running the memory at 1066 Mhz (the native FSB speed of 
the system) is not stable. I can't quite push the memory that fast -- 
1000 Mhz is where stability tops out. At 1066 Mhz memory speed, I start 
getting rounding errors after 3-4 hours of torture testing.

Now, here's what I'm wondering. Is it possible that the source of these 
slower benchmarks is that tiny discrepancy between the FSB speed and the 
RAM speed? Would there be timing delays in running the memory just 
SLIGHTLY slower than the FSB that P95 doesn't much like due to its use 
of RAM for the lookup tables?

Or, is it simpler than that? Am I perhaps bumping up against L2 cache 
thrashing, something that might be common on Intel multi-core machines 
that share a single L2 cache like this? (Nehalem, where art thou?) Is 
the benchmark listed simply what one core would do if it had exclusive 
use of the L2 cache while other cores were idle, and the lower iteration 
times are to be expected when all four cores are each working on their 
respective exponents, contending for a single L2 cache?

This last theory seems to be supported by the fact that I paused all but 
one core and let it iterate on that one core with near-exclusive use of 
the otherwise idle system, and the iteration times fell dramatically, to 
0.050 sustained and "best time" from the Options->Benchmark menu of 
0.048. That's more in line with what I'd expect, given the benchmark 
pages showing a stock clocked QX6700 cranking at 0.0569, combined with 
my overclock.

So, to cut to the chase, the benchmarks seem to be geared to exclusive 
use of the L2 cache, but in the real world that's not how I'd imagine 
most GIMPS users run P95 on a multi-core system. If my hypothesis is 
correct, wouldn't it be better to post separate benchmarks for "one core 
in use" versus "all cores in use", so that people's expectations aren't 
skewed by a benchmark table that doesn't represent typical use?

Any guidance or experience would be welcome.

Thanks!

Jeff Woods
Reading, PA



_______________________________________________
Prime mailing list
[email protected]
http://hogranch.com/mailman/listinfo/prime

_______________________________________________
Prime mailing list
[email protected]
http://hogranch.com/mailman/listinfo/prime

Reply via email to