That's a bug with
TORQUE. However manually deleting the job works.
BTW, I think there's also a
--force option with qdel in newer versions of TORQUE, not sure if it's available
in the version we have (or maybe I got confused with SGE).
Regarding the RPM, it should
be "openmpi-switcher-modulefile", not "openmpi-modulefile". Sorry for the
confusion.
Cheers,
Bernard
From: Brad Aisa [mailto:[EMAIL PROTECTED]
Sent: Sun 23/07/2006 17:18
To: oscar devel
Cc: Bernard Li
Subject: Re: errors during cluster test
<big
sigh>
ok, there seems to be a bug in the MPITEST (or maybe pbs) whereby if the test fails, it leaves the job in the pbs queue and it doesn't seem to be deleteable
here is my qstat after the failed test:
[EMAIL PROTECTED] ~]# qstat
Job id Name User Time Use S Queue
------------------- ---------------- ---------------- -------- - -----
5.janus openmpitest oscartst 0 R workq
[EMAIL PROTECTED] ~]# qdel 5.janus
[EMAIL PROTECTED] ~]# qstat
Job id Name User Time Use S Queue
------------------- ---------------- ---------------- -------- - -----
5.janus openmpitest oscartst 0 R workq
These instructions I found somewhere for manually deleteing:
also, package "openmpi-modulefile" is not installed on any node (i'll try to install that)
Brad Aisa
baisa at brad-aisa dot com
ok, there seems to be a bug in the MPITEST (or maybe pbs) whereby if the test fails, it leaves the job in the pbs queue and it doesn't seem to be deleteable
here is my qstat after the failed test:
[EMAIL PROTECTED] ~]# qstat
Job id Name User Time Use S Queue
------------------- ---------------- ---------------- -------- - -----
5.janus openmpitest oscartst 0 R workq
[EMAIL PROTECTED] ~]# qdel 5.janus
[EMAIL PROTECTED] ~]# qstat
Job id Name User Time Use S Queue
------------------- ---------------- ---------------- -------- - -----
5.janus openmpitest oscartst 0 R workq
These instructions I found somewhere for manually deleteing:
If the node crashed and reinstalled itself you need to remove the jobs from
the queue manually by removing the JB and SC file belonging to the jobs in
question from /var/spool/pbs/server_priv/jobs. I recommend that you do "service
pbs_server stop" before and "service pbs_server start" after you do this.
also, package "openmpi-modulefile" is not installed on any node (i'll try to install that)
baisa at brad-aisa dot com
-----
Original Message ----
From: Bernard Li <[EMAIL PROTECTED]>
To: Brad Aisa <[EMAIL PROTECTED]>; oscar devel <[email protected]>
Cc: Erich Focht <[EMAIL PROTECTED]>
Sent: Sunday, July 23, 2006 5:38:29 PM
Subject: RE: errors during cluster test
From: Brad Aisa [mailto:[EMAIL PROTECTED]
Sent: Sun 23/07/2006 16:22
To: Bernard Li; oscar devel
Cc: Erich Focht
Subject: Re: errors during cluster test
From: Bernard Li <[EMAIL PROTECTED]>
To: Brad Aisa <[EMAIL PROTECTED]>; oscar devel <[email protected]>
Cc: Erich Focht <[EMAIL PROTECTED]>
Sent: Sunday, July 23, 2006 5:38:29 PM
Subject: RE: errors during cluster test
If you have jobs running in
your cluster, the tests won't work because it needs 15 nodes to run your tests
(i.e. use up all your nodes).
Check to see if you have jobs
running:
# qstat
If there are, remove
them:
# qdel
<jobid>
Also, you might want to check
the output of pbsnodes -a, to see if you have nodes which are down (according to
TORQUE).
Cheers,
Bernard
From: Brad Aisa [mailto:[EMAIL PROTECTED]
Sent: Sun 23/07/2006 16:22
To: Bernard Li; oscar devel
Cc: Erich Focht
Subject: Re: errors during cluster test
no
.err or .out files -- i looked at all files in all subdirectories, and none were
older than the installation, none dated the day/time of the tests
btw, it was not the same mpi test failure, didn't seem to even get that far -- complained about not enough nodes -- i've attached the png
as for the node commands, i'll have to run those next time i fire everything up, but the yume update of the nodes to the new openmpi did work, so my repo, my headnode, my image, and my clients are all updated on that front
thanks for any help!
Brad Aisa
baisa at brad-aisa dot com
btw, it was not the same mpi test failure, didn't seem to even get that far -- complained about not enough nodes -- i've attached the png
as for the node commands, i'll have to run those next time i fire everything up, but the yume update of the nodes to the new openmpi did work, so my repo, my headnode, my image, and my clients are all updated on that front
thanks for any help!
baisa at brad-aisa dot com
-----
Original Message ----
From: Bernard Li <[EMAIL PROTECTED]>
To: Brad Aisa <[EMAIL PROTECTED]>; oscar devel <[email protected]>
Cc: Erich Focht <[EMAIL PROTECTED]>
Sent: Sunday, July 23, 2006 2:24:22 PM
Subject: RE: errors during cluster test
From: Bernard Li <[EMAIL PROTECTED]>
To: Brad Aisa <[EMAIL PROTECTED]>; oscar devel <[email protected]>
Cc: Erich Focht <[EMAIL PROTECTED]>
Sent: Sunday, July 23, 2006 2:24:22 PM
Subject: RE: errors during cluster test
There are no .err and .out
files in the package directory (like /home/oscartst/openmpi)? (BTW I will
change the text about the log files to more clear...)
Anyways, can you show the
output of testing again? And also the output of:
# cexec rpm -q
openmpi
# cexec rpm -q
openmpi-modulefile
------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________ Oscar-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/oscar-devel
