-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
On 09/28/2015 01:58 PM, Luis Ogando wrote:
> The problem is solved ! The solution was one suggested by Lyudmila
> Dobysheva : reboot the nodes. We will never know the origin of the
> problem, but, honestly, I do not care !
Good to hear that! So,
Hi Lyudmila,
Thanks again !
I will ask them.
All the best,
Luis
2015-09-29 10:37 GMT-03:00 Lyudmila Dobysheva :
> 29.09.2015 14:57, Laurence Marks wrote:
>
>> If it happens again, one thing to ask them to check is swap usage and
>> how much memory is
29.09.2015 14:57, Laurence Marks wrote:
If it happens again, one thing to ask them to check is swap usage and
how much memory is cached.
...
Alternatively it was something else, a zombie, big log files or other
things. Rebooting gets rid of a lot of system caches and helps
I stand for losing
From the top's sent before, it looks like the administrators might have
configured the system with no swap:
r1i1n2
Swap:0M total,0M used,0M free,10563M cached
r1i1n3
Swap:0M total,0M used,0M free,23089M cached
Keep in mind that having
Hi Lyudmila,
Unfortunately, they do not have "top mode 1" output corresponding to the
problem period.
Thanks again.
All the best,
Luis
2015-09-29 10:37 GMT-03:00 Lyudmila Dobysheva :
> 29.09.2015 14:57, Laurence Marks wrote:
>
>> If it happens again, one
Dear Prof. Marks,
Thanks !
I will send your message to the administrators !
All the best,
Luis
2015-09-29 8:57 GMT-03:00 Laurence Marks :
> If it happens again, one thing to ask them to check is swap usage and how
> much memory is cached. On
Hi Elias,
There were no other jobs in the specific queue I was using and the nodes
are dedicated to that queue, so, it was the opportunity to reboot them
without furious reactions from other users.
After trying everything suggested by the Wien2k community, the
administrators resignedly
If it happens again, one thing to ask them to check is swap usage and how
much memory is cached. On some of my nodes I have noticed that they do not
always release cached memory, and can start swapping. If this happens the
job will get very slow. The commands to use to clear the cache can be found
Dear Wien2k community,
I would like to thank so many hints !
The problem is solved ! The solution was one suggested by Lyudmila
Dobysheva : reboot the nodes. We will never know the origin of the problem,
but, honestly, I do not care !
"There are more things in heaven and earth, Horatio,
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
Sounds like a nasty problem … In terms of strategy, I think the first
thing should be to find out if the node is really to blame. If so,
you have to convince the admins and/or find a way to avoid it. If
not, you can turn to figuring out whatever
Hello,
I'd suggest trying three things.
First of all - does your cluster allow running interactive jobs? If yes,
than you should create an interactive job to run /bin/bash. I'm not
familiar with PBS, but in SGE/OGE if you print cluster queues with
"qstat -f" you'll see "I" in column qtype
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
Luis,
First of all, I wonder: To what extent is this problem reproducible?
E.g., does your job always run on the same 4 nodes? Is it always the
same node(s) that are slow? Does the problem also show up in other
calculations (maybe just changing the
Dear Prof. Marks,
As I suspected, users can not use ganglia. Our administrators are very
jealous !!
Dear Elias Assmann,
Many thanks for your comments. I will try to comment on some of them.
First of all, I wonder: To what extent is this problem reproducible?
> E.g., does your job always
Ganglia is web based, you don't need ssh. Please read the link I sent.
---
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
http://www.numis.northwestern.edu
Corrosion in 4D http://MURI4D.numis.northwestern.edu
Co-Editor, Acta Cryst A
"Research is
Nooo!
You should use ganglia yourself.
---
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
http://www.numis.northwestern.edu
Corrosion in 4D http://MURI4D.numis.northwestern.edu
Co-Editor, Acta Cryst A
"Research is to see what everybody else
0K ! In this case, I will try it !
Many thanks,
Luis
2015-09-23 9:23 GMT-03:00 Laurence Marks :
> Ganglia is web based, you don't need ssh. Please read the link I sent.
>
> ---
> Professor Laurence Marks
> Department of Materials Science and
Hi,
I can not access the nodes. SSH among them is forbidden ! We have to ask
the administrators for anything !! It is the hell !!
Of course, only the PBS jobs can "travel" among the nodes.
All the best,
Luis
2015-09-23 9:14 GMT-03:00 Laurence Marks
Dear Prof. Marks,
Thank you for your comment.
I sent your suggestions to the administrators.
All the best,
Luis
2015-09-23 8:56 GMT-03:00 Laurence Marks :
> It is hard to work this out remotely, particularly with unfriendly
> sys_admin.
>
> I
Dear Prof. Blaha and Lyudmila Dobysheva,
Many thanks for your comments !
Unfortunately, users have no privileges in the cluster. I will send your
comments to the administrators and let's see what happens.
Many thanks again,
Luis
It is hard to work this out remotely, particularly with unfriendly
sys_admin.
I would find out if you have ganglia available, see
http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi/linux/bks/SGI_Admin/books/ICEX_Admin_Guide/sgi_html/ch05.html#Z1190844523tls
. This is much more useful than
Of course, at the "same time" ONLY lapw0_mpi OR lapw1_mpi should be
running.
However, I assume you did these "tops" sequentially one after the
other ??? and of course, in an scf-cycle, after a few minutes running
lapw0, lapw1 will start
Do these tests in several windows in parallel.
22.09.2015 23:08, Luis Ogando wrote:
r1i1n1 -
top - 17:40:46 up 12 days, 9 min, 2 users, load average: 10.55, 4.34, 1.74
Cpu(s):100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si,
r1i1n2 -
top - 17:42:30 up 221 days, 6:29, 1 user, load average: 10.76,
23.09.2015 12:22, Lyudmila Dobysheva wrote:
the jobs are all at one processor of the node
Try for to be sure:
In top at n2 type "1" to show individual CPU usage.
It is better to make this after some time to pass the starting phase.
23.09.2015 11:25, Peter Blaha wrote:
> With only a few
22.09.2015 23:08, Luis Ogando wrote:
r1i1n2
PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND
2096 ogando20 0 927m 642m 20m R9 1.8 0:09.30 lapw1c_mpi
2109 ogando20 0 926m 633m 17m R9 1.8 0:14.58 lapw1c_mpi
2122 ogando20 0 924m 633m
23.09.2015 12:20, Lyudmila Dobysheva wrote:
the jobs are all at one node
at one processor of the node, of course
Lyudmila Dobysheva
--
Phys.-Techn. Institute of Ural Br. of Russian Ac. of Sci.
426001 Izhevsk, ul.Kirova 132
With only a few seconds cpu time, the job is just in the starting phase
(allocating memory, reading files, distributing data) and thus cpu-load
is very low.
A few seconds later, this should reach about 100 % for each lapw1_mpi.
On 09/23/2015 11:20 AM, Lyudmila Dobysheva wrote:
22.09.2015
Trying to decrease the size of a previous message !!!
--
Dear Prof. Blaha and Marks,
Please, find
Dear Prof. Marks,
Many thanks for your help.
The administrators said that everything is 0K, the software is the
problem (the easy answer) : no zombies, no other jobs in the node, ... !!
Let me give you more information to see if you can imagine other
possibilities:
1) Intel Xeon Six
a) Check your .machines file. DFoes it meet your expectations, or has
this node too large load.
b) Can you interactively login into these nodes while your job is running ?
If yes, login on 2 nodes (in two windows) and runtop
c) If nothing obvious is wrong so far, test the network by doing
Dear Professor Blaha,
Thank you !
My .machines file is 0K.
I will ask the administrator to follow your other suggestions (users do
not have privileges).
All the best,
Luis
2015-09-21 10:22 GMT-03:00 Peter Blaha :
> a) Check your .machines
Almost certainly one or more of:
* Other jobs on the node
* Zombie process(es)
* Too many mpi
* Bad memory
* Full disc
* Too hot
If you have it use ganglia, if not ssh in and use top/ps or whatever SGI
has. If you cannot sudo get help from someone who can.
On Sep 18, 2015 8:58 PM, "Luis Ogando"
Dear Wien2k community,
I am using Wien2k in a SGI cluster with 32 nodes. My calculation is
running in 4 nodes that have the same characteristics and only my job is
running in these 4 nodes.
I noticed that one of these 4 nodes is spending more than 20 times the
time spent by the other 3
32 matches
Mail list logo