Hi Jorg
If you have single quad core nodes as you said,
then top shows that you are oversubscribing the cores.
There are five nwchem processes are running.
In my experience, oversubscription only works in relatively
light MPI programs (say the example programs that come with OpenMPI or
MPICH).
Real world applications tend to be very inefficient,
and can even hang on oversubscribed CPUs.
What happens when you launch four or less processes
on a node instead of five?
My $0.02.
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------
Jörg Saßmannshausen wrote:
Dear all,
I am scratching my head but apart from getting splinters into my fingers I
cannot find a good answer for the following problem:
I am running a DFT program (NWChem) in parallel on our cluster (AMD Opterons,
single quad cores in the node, 12 GB of RAM, Gigabit network) and at certain
stages of the run top is presenting me with that:
top - 15:10:48 up 13 days, 22:20, 1 user, load average: 0.26, 0.24, 0.19
Tasks: 106 total, 1 running, 105 sleeping, 0 stopped, 0 zombie
Cpu0 : 8.0% us, 2.7% sy, 0.0% ni, 82.7% id, 0.0% wa, 1.3% hi, 5.3% si
Cpu1 : 4.1% us, 1.4% sy, 0.0% ni, 94.6% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu2 : 2.7% us, 0.0% sy, 0.0% ni, 97.3% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu3 : 0.0% us, 0.0% sy, 0.0% ni, 100.0% id, 0.0% wa, 0.0% hi, 0.0% si
Mem: 12250540k total, 5581756k used, 6668784k free, 273396k buffers
Swap: 16779884k total, 0k used, 16779884k free, 3841688k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
16885 sassy 15 0 3928m 1.7g 1.4g S 4 14.4 312:19.92 nwchem
16886 sassy 15 0 3928m 1.7g 1.4g S 4 14.5 313:08.77 nwchem
16887 sassy 15 0 3920m 1.7g 1.4g S 3 14.4 316:18.24 nwchem
16888 sassy 15 0 3923m 1.6g 1.3g S 3 13.3 316:13.55 nwchem
16890 sassy 15 0 2943m 1.7g 1.7g S 3 14.8 104:32.33 nwchem
It is not a few seconds it does it, it appears to be for a prolonged period of
time. I checked it randomly for say 1 min and the performance is well below
50 % (most of the time around 20 %). I have not noticed that when I am
running the job within one node.
I have the suspicion that the Gigabit network is the problem, but I really
would like to pinpoint that so I can get my boss to upgrade to a better
network for parallel computing (hence my previous question about Open-MX).
Now how, as I am not an admin of that cluster, would I be able to do that?
Thanks for your comments.
Best wishes from Glasgow!
Jörg
_______________________________________________
Beowulf mailing list, [email protected] sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf