That's a bug with TORQUE.  However manually deleting the job works.
 
BTW, I think there's also a --force option with qdel in newer versions of TORQUE, not sure if it's available in the version we have (or maybe I got confused with SGE).
 
Regarding the RPM, it should be "openmpi-switcher-modulefile", not "openmpi-modulefile".  Sorry for the confusion.
 
Cheers,
 
Bernard


From: Brad Aisa [mailto:[EMAIL PROTECTED]
Sent: Sun 23/07/2006 17:18
To: oscar devel
Cc: Bernard Li
Subject: Re: errors during cluster test

<big sigh>

ok, there seems to be a bug in the MPITEST (or maybe pbs) whereby if the test fails, it leaves the job in the pbs queue and it doesn't seem to be deleteable

here is my qstat after the failed test:

[EMAIL PROTECTED] ~]# qstat
Job id              Name             User             Time Use S Queue
------------------- ---------------- ---------------- -------- - -----
5.janus             openmpitest      oscartst                0 R workq
[EMAIL PROTECTED] ~]# qdel 5.janus
[EMAIL PROTECTED] ~]# qstat
Job id              Name             User             Time Use S Queue
------------------- ---------------- ---------------- -------- - -----
5.janus             openmpitest      oscartst                0 R workq

These instructions I found somewhere for manually deleteing:

If the node crashed and reinstalled itself you need to remove the jobs from 
the queue manually by removing the JB and SC file belonging to the jobs in
question from /var/spool/pbs/server_priv/jobs. I recommend that you do "service
pbs_server stop" before and "service pbs_server start" after you do this.


also, package "openmpi-modulefile" is not installed on any node (i'll try to install that)
 
Brad Aisa
baisa at brad-aisa dot com


----- Original Message ----
From: Bernard Li <[EMAIL PROTECTED]>
To: Brad Aisa <[EMAIL PROTECTED]>; oscar devel <[email protected]>
Cc: Erich Focht <[EMAIL PROTECTED]>
Sent: Sunday, July 23, 2006 5:38:29 PM
Subject: RE: errors during cluster test

If you have jobs running in your cluster, the tests won't work because it needs 15 nodes to run your tests (i.e. use up all your nodes).
 
Check to see if you have jobs running:
 
# qstat
 
If there are, remove them:
 
# qdel <jobid>
 
Also, you might want to check the output of pbsnodes -a, to see if you have nodes which are down (according to TORQUE).
 
Cheers,
 
Bernard


From: Brad Aisa [mailto:[EMAIL PROTECTED]
Sent: Sun 23/07/2006 16:22
To: Bernard Li; oscar devel
Cc: Erich Focht
Subject: Re: errors during cluster test

no .err or .out files -- i looked at all files in all subdirectories, and none were older than the installation, none dated the day/time of the tests

btw, it was not the same mpi test failure, didn't seem to even get that far -- complained about not enough nodes -- i've attached the png

as for the node commands, i'll have to run those next time i fire everything up, but the yume update of the nodes to the new openmpi did work, so my repo, my headnode, my image, and my clients are all updated on that front

thanks for any help!

Brad Aisa
baisa at brad-aisa dot com


----- Original Message ----
From: Bernard Li <[EMAIL PROTECTED]>
To: Brad Aisa <[EMAIL PROTECTED]>; oscar devel <[email protected]>
Cc: Erich Focht <[EMAIL PROTECTED]>
Sent: Sunday, July 23, 2006 2:24:22 PM
Subject: RE: errors during cluster test

There are no .err and .out files in the package directory (like /home/oscartst/openmpi)? (BTW I will change the text about the log files to more clear...)
 
Anyways, can you show the output of testing again?  And also the output of:
 
# cexec rpm -q openmpi
# cexec rpm -q openmpi-modulefile
 


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Oscar-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/oscar-devel

Reply via email to