I've managed to solve the issue with the pbs_nodes.
the client nodes had (in /var/spool/pbs/mom_priv/config) a name for the
head which didn't match the hosts file.
My best guess is this had to do with the fact that my heads nodes public
interface's IP/name was loaded by DHCP and when I swapped out the faulty
network card. The new MAC on the new network card would have aquired a
new DHCP lease and updated the head nodes name /hosts file.
manually corrected and works now.
my LAM and MPICH via torque still fail
the shellout.err has
::::::::::::::
/home/oscartst/torque/shelltest.err
::::::::::::::
error 15010 on spawn
error 15010 on spawn
error 15010 on spawn
error 15010 on spawn
::::::::::::::
/home/oscartst/torque/shelltest.out
::::::::::::::
cc002.pg-207.computing.dcu.ie
cc001.pg-207.computing.dcu.ie
Hello, date is 04/05/06, time is 15:58:34
Hello, date is 04/05/06, time is 15:58:46
pbs log on headnode gives (editied to time)
04/05/2006 15:43:48;0040;PBS_Server;Svr;pg-207;Scheduler sent command time
04/05/2006 15:44:09;0040;PBS_Server;Req;is_stat_get;node
cc002.pg-207.computing.dcu.ie marked available
04/05/2006 15:44:16;0040;PBS_Server;Req;is_stat_get;node
cc003.pg-207.computing.dcu.ie marked available
04/05/2006 15:44:19;0040;PBS_Server;Req;is_stat_get;node
cc004.pg-207.computing.dcu.ie marked available
04/05/2006 15:44:26;0040;PBS_Server;Req;is_stat_get;node
cc001.pg-207.computing.dcu.ie marked available
04/05/2006 15:44:48;0040;PBS_Server;Svr;pg-207;Scheduler sent command time
04/05/2006 15:45:48;0040;PBS_Server;Svr;pg-207;Scheduler sent command time
04/05/2006 15:46:48;0040;PBS_Server;Svr;pg-207;Scheduler sent command time
04/05/2006 15:47:48;0040;PBS_Server;Svr;pg-207;Scheduler sent command time
04/05/2006 15:48:48;0040;PBS_Server;Svr;pg-207;Scheduler sent command time
04/05/2006 15:49:48;0040;PBS_Server;Svr;pg-207;Scheduler sent command time
04/05/2006 15:50:48;0040;PBS_Server;Svr;pg-207;Scheduler sent command time
04/05/2006 15:51:48;0040;PBS_Server;Svr;pg-207;Scheduler sent command time
04/05/2006 15:52:48;0040;PBS_Server;Svr;pg-207;Scheduler sent command time
04/05/2006 15:53:48;0040;PBS_Server;Svr;pg-207;Scheduler sent command time
04/05/2006 15:54:48;0040;PBS_Server;Svr;pg-207;Scheduler sent command time
04/05/2006 15:55:48;0040;PBS_Server;Svr;pg-207;Scheduler sent command time
04/05/2006 15:56:44;0040;PBS_Server;Svr;pg-207;Scheduler sent command new
04/05/2006 15:56:51;0040;PBS_Server;Svr;pg-207;Scheduler sent command term
04/05/2006 15:57:15;0040;PBS_Server;Svr;pg-207;Scheduler sent command new
04/05/2006 15:57:32;0040;PBS_Server;Svr;pg-207;Scheduler sent command term
04/05/2006 15:57:56;0040;PBS_Server;Svr;pg-207;Scheduler sent command new
04/05/2006 15:58:16;0040;PBS_Server;Svr;pg-207;Scheduler sent command term
04/05/2006 15:58:19;0040;PBS_Server;Svr;pg-207;Scheduler sent command new
04/05/2006 15:58:46;0040;PBS_Server;Svr;pg-207;Scheduler sent command term
given the shelltest has two 'commands'
(hostname and date/time )
and
from google : the torque error 15010 is ' system error occurred'
It looks to me that the original 2 nodes work but the added nodes do
not. (logic 2 nodes * 2 commands = 4 errors)
/nc
Neil Costigan wrote:
Bernard Li wrote:
What's the output of 'pbsnodes -a'?
pbsnodes -a returns that all are unknown or down
[EMAIL PROTECTED] oscar]# pbsnodes -a
cc001.pg-207.computing.dcu.ie
state = state-unknown,down
np = 1
properties = all
ntype = cluster
cc002.pg-207.computing.dcu.ie
state = state-unknown,down
np = 1
properties = all
ntype = cluster
cc003.pg-207.computing.dcu.ie
state = state-unknown,down
np = 1
properties = all
ntype = cluster
cc004.pg-207.computing.dcu.ie
state = state-unknown,down
np = 1
properties = all
ntype = cluster
Is pbs_mom running on all your client nodes?
a ps aux | grep pbs_mon on all nodes shows it is.
i have tried moving the pbs_oscar alias from the private to the public
address in /etc/hosts
with no success
to recap.
* OSCAR version 4.2.1b5
* Fedora Core 3
* x86
- successfully passed test_cluster after inital set up with head node
and two compute nodes. happy days.
- test fails after adding two new nodes which are up and alive. can
mount /home and pass ssh pings, pvm etc.
but fail pbs
/opt/pbs/bin/pbsnodes: cannot connect to server pbs_oscar, error=111
then fails with not enough free nodes.
/nc
Cheers,
Bernard
well it was going well
i added two more nodes
and now it fails
[EMAIL PROTECTED] oscar]# testing/test_cluster
Performing root tests...
Maui service
check:maui
[PASSED]
Shutting down TORQUE Server: [ OK ]
Connection refused
/opt/pbs/bin/pbsnodes: cannot connect to server pbs_oscar, error=111
Torque node
check
[PASSED]
Starting TORQUE Server: [ OK ]
Torque service
check:pbs_server
[PASSED]
/home mounts
[PASSED]
Preparing user tests...
Performing user tests...
SSH ping
test
[PASSED]
SSH server-
>node
[PASSED]
SSH node-
>server
[PASSED]
Checking for 4 free
nodes:
[FAILED]
Not enough free nodes. Tests incomplete.
Checking for 4 free
nodes:
[FAILED]
Not enough free nodes. Tests incomplete.
Checking for 4 free
nodes:
[FAILED]
Not enough free nodes. Tests incomplete.
Torque default queue
definition
[PASSED]
Checking for 4 free
nodes:
[FAILED]
Not enough free nodes. Tests incomplete.
Ganglia setup
test
[PASSED]
Ganglia node count
test
[PASSED]
-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting
language
that extends applications into web and mobile media. Attend the live
webcast
and join the prime developer group breaking into this new coding
territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Oscar-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/oscar-users
-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Oscar-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/oscar-users