I've managed to solve the issue with the pbs_nodes.
the client nodes had (in /var/spool/pbs/mom_priv/config) a name for the head which didn't match the hosts file. My best guess is this had to do with the fact that my heads nodes public interface's IP/name was loaded by DHCP and when I swapped out the faulty network card. The new MAC on the new network card would have aquired a new DHCP lease and updated the head nodes name /hosts file.
manually corrected and works now.

my LAM and MPICH via torque still fail
the shellout.err has
::::::::::::::
/home/oscartst/torque/shelltest.err
::::::::::::::
error 15010 on spawn
error 15010 on spawn
error 15010 on spawn
error 15010 on spawn
::::::::::::::
/home/oscartst/torque/shelltest.out
::::::::::::::
cc002.pg-207.computing.dcu.ie
cc001.pg-207.computing.dcu.ie
Hello, date is 04/05/06, time is 15:58:34
Hello, date is 04/05/06, time is 15:58:46


pbs log on headnode gives  (editied to time)
04/05/2006 15:43:48;0040;PBS_Server;Svr;pg-207;Scheduler sent command time
04/05/2006 15:44:09;0040;PBS_Server;Req;is_stat_get;node cc002.pg-207.computing.dcu.ie marked available 04/05/2006 15:44:16;0040;PBS_Server;Req;is_stat_get;node cc003.pg-207.computing.dcu.ie marked available 04/05/2006 15:44:19;0040;PBS_Server;Req;is_stat_get;node cc004.pg-207.computing.dcu.ie marked available 04/05/2006 15:44:26;0040;PBS_Server;Req;is_stat_get;node cc001.pg-207.computing.dcu.ie marked available
04/05/2006 15:44:48;0040;PBS_Server;Svr;pg-207;Scheduler sent command time
04/05/2006 15:45:48;0040;PBS_Server;Svr;pg-207;Scheduler sent command time
04/05/2006 15:46:48;0040;PBS_Server;Svr;pg-207;Scheduler sent command time
04/05/2006 15:47:48;0040;PBS_Server;Svr;pg-207;Scheduler sent command time
04/05/2006 15:48:48;0040;PBS_Server;Svr;pg-207;Scheduler sent command time
04/05/2006 15:49:48;0040;PBS_Server;Svr;pg-207;Scheduler sent command time
04/05/2006 15:50:48;0040;PBS_Server;Svr;pg-207;Scheduler sent command time
04/05/2006 15:51:48;0040;PBS_Server;Svr;pg-207;Scheduler sent command time
04/05/2006 15:52:48;0040;PBS_Server;Svr;pg-207;Scheduler sent command time
04/05/2006 15:53:48;0040;PBS_Server;Svr;pg-207;Scheduler sent command time
04/05/2006 15:54:48;0040;PBS_Server;Svr;pg-207;Scheduler sent command time
04/05/2006 15:55:48;0040;PBS_Server;Svr;pg-207;Scheduler sent command time
04/05/2006 15:56:44;0040;PBS_Server;Svr;pg-207;Scheduler sent command new
04/05/2006 15:56:51;0040;PBS_Server;Svr;pg-207;Scheduler sent command term
04/05/2006 15:57:15;0040;PBS_Server;Svr;pg-207;Scheduler sent command new
04/05/2006 15:57:32;0040;PBS_Server;Svr;pg-207;Scheduler sent command term
04/05/2006 15:57:56;0040;PBS_Server;Svr;pg-207;Scheduler sent command new
04/05/2006 15:58:16;0040;PBS_Server;Svr;pg-207;Scheduler sent command term
04/05/2006 15:58:19;0040;PBS_Server;Svr;pg-207;Scheduler sent command new
04/05/2006 15:58:46;0040;PBS_Server;Svr;pg-207;Scheduler sent command term



given the shelltest has two 'commands'
(hostname and date/time )
and
from google : the torque error 15010 is ' system error occurred'

It looks to me that the original 2 nodes work but the added nodes do not. (logic 2 nodes * 2 commands = 4 errors)


/nc




Neil Costigan wrote:

Bernard Li wrote:

What's the output of 'pbsnodes -a'?


pbsnodes -a returns that all are unknown or down

[EMAIL PROTECTED] oscar]# pbsnodes -a
cc001.pg-207.computing.dcu.ie
    state = state-unknown,down
    np = 1
    properties = all
    ntype = cluster

cc002.pg-207.computing.dcu.ie
    state = state-unknown,down
    np = 1
    properties = all
    ntype = cluster

cc003.pg-207.computing.dcu.ie
    state = state-unknown,down
    np = 1
    properties = all
    ntype = cluster

cc004.pg-207.computing.dcu.ie
    state = state-unknown,down
    np = 1
    properties = all
    ntype = cluster


Is pbs_mom running on all your client nodes?


a ps aux | grep pbs_mon on all nodes shows it is.


i have tried moving the pbs_oscar alias from the private to the public address in /etc/hosts
with no success

to recap.

   * OSCAR version 4.2.1b5
   * Fedora Core 3
   * x86

- successfully passed test_cluster after inital set up with head node and two compute nodes. happy days. - test fails after adding two new nodes which are up and alive. can mount /home and pass ssh pings, pvm etc.
but fail pbs

/opt/pbs/bin/pbsnodes: cannot connect to server pbs_oscar, error=111
then fails with not enough free nodes.


/nc

Cheers,

Bernard

well it was going well

i added two more nodes
and now it fails

[EMAIL PROTECTED] oscar]# testing/test_cluster
Performing root tests...
Maui service check:maui [PASSED]
Shutting down TORQUE Server:                               [  OK  ]
Connection refused
/opt/pbs/bin/pbsnodes: cannot connect to server pbs_oscar, error=111
Torque node check [PASSED]
Starting TORQUE Server:                                    [  OK  ]
Torque service check:pbs_server [PASSED] /home mounts [PASSED]

Preparing user tests...
Performing user tests...
SSH ping test [PASSED] SSH server- >node [PASSED] SSH node- >server [PASSED] Checking for 4 free nodes: [FAILED]
Not enough free nodes. Tests incomplete.
Checking for 4 free nodes: [FAILED]
Not enough free nodes. Tests incomplete.
Checking for 4 free nodes: [FAILED]
Not enough free nodes. Tests incomplete.
Torque default queue definition [PASSED] Checking for 4 free nodes: [FAILED]
Not enough free nodes. Tests incomplete.
Ganglia setup test [PASSED] Ganglia node count test [PASSED]




-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Oscar-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/oscar-users




-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Oscar-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/oscar-users

Reply via email to