Hi Jorg

If you have single quad core nodes as you said,
then top shows that you are oversubscribing the cores.
There are five nwchem processes are running.

In my experience, oversubscription only works in relatively
light MPI programs (say the example programs that come with OpenMPI or
MPICH).
Real world applications tend to be very inefficient,
and can even hang on oversubscribed CPUs.

What happens when you launch four or less processes
on a node instead of five?

My $0.02.
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------




Jörg Saßmannshausen wrote:
Dear all,

I am scratching my head but apart from getting splinters into my fingers I cannot find a good answer for the following problem: I am running a DFT program (NWChem) in parallel on our cluster (AMD Opterons, single quad cores in the node, 12 GB of RAM, Gigabit network) and at certain stages of the run top is presenting me with that:

top - 15:10:48 up 13 days, 22:20,  1 user,  load average: 0.26, 0.24, 0.19
Tasks: 106 total,   1 running, 105 sleeping,   0 stopped,   0 zombie
Cpu0  :  8.0% us,  2.7% sy,  0.0% ni, 82.7% id,  0.0% wa,  1.3% hi,  5.3% si
Cpu1  :  4.1% us,  1.4% sy,  0.0% ni, 94.6% id,  0.0% wa,  0.0% hi,  0.0% si
Cpu2  :  2.7% us,  0.0% sy,  0.0% ni, 97.3% id,  0.0% wa,  0.0% hi,  0.0% si
Cpu3  :  0.0% us,  0.0% sy,  0.0% ni, 100.0% id,  0.0% wa,  0.0% hi,  0.0% si
Mem:  12250540k total,  5581756k used,  6668784k free,   273396k buffers
Swap: 16779884k total,        0k used, 16779884k free,  3841688k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
16885 sassy     15   0 3928m 1.7g 1.4g S    4 14.4 312:19.92 nwchem
16886 sassy     15   0 3928m 1.7g 1.4g S    4 14.5 313:08.77 nwchem
16887 sassy     15   0 3920m 1.7g 1.4g S    3 14.4 316:18.24 nwchem
16888 sassy     15   0 3923m 1.6g 1.3g S    3 13.3 316:13.55 nwchem
16890 sassy     15   0 2943m 1.7g 1.7g S    3 14.8 104:32.33 nwchem

It is not a few seconds it does it, it appears to be for a prolonged period of time. I checked it randomly for say 1 min and the performance is well below 50 % (most of the time around 20 %). I have not noticed that when I am running the job within one node.

I have the suspicion that the Gigabit network is the problem, but I really would like to pinpoint that so I can get my boss to upgrade to a better network for parallel computing (hence my previous question about Open-MX). Now how, as I am not an admin of that cluster, would I be able to do that?

Thanks for your comments.

Best wishes from Glasgow!

Jörg


_______________________________________________
Beowulf mailing list, [email protected] sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to