On Friday 15 September 2006 00:07, Jeff Gilchrist wrote:
>
> A couple of observations.  Right now I am running the second
> verification of M44 on a shared machine (now 92% complete).  The
> software greatly slows down if someone else launches a process that
> starts using up CPU time on the same processor as a Glucas thread.

Makes sense to me. Nothing can execute without using CPU cycles, so if you're 
trying to fully utilise a system, you're bound to lose some cycles when 
another process (or system interrupt) intervenes. The problem gets worse 
because other activity tends to trash cache contents.

> The wait for all threads to finish can sometimes bring the overall
> software to a crawl because the OS has scheduled multiple
> threads/processes on the same CPU.

Yes, this would seem to be logical. Though the advanced scheduling algorithms 
in e.g. linux 2.6 should at least help prevent severe imbalance from 
persisting.
>
> So for a dedicated Prime95 system, this shouldn't be a problem, but on
> a typical system if the user is running multiple processes that use a
> non-trivial amount of CPU time it could cause major slowdowns in the
> software.  Tony's idea of leaving 1 or 2 cores for the OS the schedule
> other jobs on with a largers system could make a big difference.

On a multipurpose system, yes. It's clearly a waste on a system that's rarely 
(if ever) doing anything else. A first approach is not to even try to use the 
"virtual" processors in a CPU supporting hyperthreading; leave those for 
background activity.
>
> The other thing you have to be careful with when doing multi-threaded
> programming is dynamic memory allocation.  I have no idea what Prime95
> currently uses or plans on using but I found with my thesis work that
> some algorithms were actually slower when parallelized because of
> memory allocation blocking.  The standard malloc/new libraries that
> most people have/use on their systems treat memory allocation as an
> atomic operation so no 2 threads try to allocate the same range in
> memory simultaneously.  If you do a lot of memory allocation in
> threads, the overall process can be greatly slowed down by threads
> blocking on a memory allocation calls.  Just something to keep in mind
> that could affect performance.

There's another effect here which might get overlooked - and which appears to 
vary between Intel and AMD systems. Intel multicore processors seem to have 
L1 cache dedicated to individual processors but share L2 between them. On the 
contrary, AMD multicore processor caches seem to be dedicated at both L1 and 
L2, the sharing is done in the memory controller.

Either way there's a bottleneck which is at the very least potential - between 
L1 and L2 in Intel systems, and between L2 and RAM in AMD systems. I believe 
this bottleneck may dominate threaded LL test performance, just as it does in 
single CPU systems when the memory bandwidth is less than the CPU may demand.

Running independent monothreaded streams on a multi-CPU system has the 
disadvantage of increasing the memory load, so we should always be able to 
gain something in total system throughput from a multithreaded approach - but 
whether this gain will ever be worthwhile is (IMO) dubious.

Despite which, the ability to complete a single test rapidly (at the expense 
of wasting some CPU cycles in the long term) is sometimes valuable in itself. 
In particular, when verifying suspected primes!

Regards
Brian Beesley
_______________________________________________
Prime mailing list
[email protected]
http://hogranch.com/mailman/listinfo/prime

Reply via email to