Is that node down?  If not, perhaps try re-starting pbs_mom on that node:
 
# /etc/init.d/pbs_mom restart
 
Cheers,
 
Bernard


From: Brad Aisa [mailto:[EMAIL PROTECTED]
Sent: Sun 23/07/2006 17:00
To: Bernard Li; oscar devel
Cc: Erich Focht
Subject: Re: errors during cluster test

the pbs stat is fine

here is the qstat thingy -- it gives an error when i try to delete the job:

[EMAIL PROTECTED] oscar]# qstat
Job id              Name             User             Time Use S Queue
------------------- ---------------- ---------------- -------- - -----
4.janus             openmpitest      oscartst                0 R workq
[EMAIL PROTECTED] oscar]# qdel 4
qdel: Server could not connect to MOM 4.janus.androticus
[EMAIL PROTECTED] oscar]#            
 
Brad Aisa
baisa at brad-aisa dot com


----- Original Message ----
From: Bernard Li <[EMAIL PROTECTED]>
To: Brad Aisa <[EMAIL PROTECTED]>; oscar devel <[email protected]>
Cc: Erich Focht <[EMAIL PROTECTED]>
Sent: Sunday, July 23, 2006 5:38:29 PM
Subject: RE: errors during cluster test

If you have jobs running in your cluster, the tests won't work because it needs 15 nodes to run your tests (i.e. use up all your nodes).
 
Check to see if you have jobs running:
 
# qstat
 
If there are, remove them:
 
# qdel <jobid>
 
Also, you might want to check the output of pbsnodes -a, to see if you have nodes which are down (according to TORQUE).
 
Cheers,
 
Bernard


From: Brad Aisa [mailto:[EMAIL PROTECTED]
Sent: Sun 23/07/2006 16:22
To: Bernard Li; oscar devel
Cc: Erich Focht
Subject: Re: errors during cluster test

no .err or .out files -- i looked at all files in all subdirectories, and none were older than the installation, none dated the day/time of the tests

btw, it was not the same mpi test failure, didn't seem to even get that far -- complained about not enough nodes -- i've attached the png

as for the node commands, i'll have to run those next time i fire everything up, but the yume update of the nodes to the new openmpi did work, so my repo, my headnode, my image, and my clients are all updated on that front

thanks for any help!

Brad Aisa
baisa at brad-aisa dot com


----- Original Message ----
From: Bernard Li <[EMAIL PROTECTED]>
To: Brad Aisa <[EMAIL PROTECTED]>; oscar devel <[email protected]>
Cc: Erich Focht <[EMAIL PROTECTED]>
Sent: Sunday, July 23, 2006 2:24:22 PM
Subject: RE: errors during cluster test

There are no .err and .out files in the package directory (like /home/oscartst/openmpi)? (BTW I will change the text about the log files to more clear...)
 
Anyways, can you show the output of testing again?  And also the output of:
 
# cexec rpm -q openmpi
# cexec rpm -q openmpi-modulefile
 


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Oscar-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/oscar-devel

Reply via email to