From: Neil Costigan [mailto:[EMAIL PROTECTED]
Sent: Wed 05/04/2006 08:41
To: Neil Costigan
Cc: Bernard Li; [email protected]
Subject: Re: [Oscar-users] help!. building client image (scientific linux 305)
I've managed to solve the issue with the pbs_nodes.
the
client nodes had (in /var/spool/pbs/mom_priv/config) a name for the
head
which didn't match the hosts file.
My best guess is this had to do with the
fact that my heads nodes public
interface's IP/name was loaded by DHCP and
when I swapped out the faulty
network card. The new MAC on the new network
card would have aquired a
new DHCP lease and updated the head nodes name
/hosts file.
manually corrected and works now.
my LAM and MPICH via
torque still fail
the shellout.err
has
::::::::::::::
/home/oscartst/torque/shelltest.err
::::::::::::::
error
15010 on spawn
error 15010 on spawn
error 15010 on spawn
error 15010 on
spawn
::::::::::::::
/home/oscartst/torque/shelltest.out
::::::::::::::
cc002.pg-207.computing.dcu.ie
cc001.pg-207.computing.dcu.ie
Hello,
date is 04/05/06, time is 15:58:34
Hello, date is 04/05/06, time is
15:58:46
pbs log on headnode gives (editied to
time)
04/05/2006 15:43:48;0040;PBS_Server;Svr;pg-207;Scheduler sent command
time
04/05/2006
15:44:09;0040;PBS_Server;Req;is_stat_get;node
cc002.pg-207.computing.dcu.ie
marked available
04/05/2006
15:44:16;0040;PBS_Server;Req;is_stat_get;node
cc003.pg-207.computing.dcu.ie
marked available
04/05/2006
15:44:19;0040;PBS_Server;Req;is_stat_get;node
cc004.pg-207.computing.dcu.ie
marked available
04/05/2006
15:44:26;0040;PBS_Server;Req;is_stat_get;node
cc001.pg-207.computing.dcu.ie
marked available
04/05/2006 15:44:48;0040;PBS_Server;Svr;pg-207;Scheduler
sent command time
04/05/2006 15:45:48;0040;PBS_Server;Svr;pg-207;Scheduler
sent command time
04/05/2006 15:46:48;0040;PBS_Server;Svr;pg-207;Scheduler
sent command time
04/05/2006 15:47:48;0040;PBS_Server;Svr;pg-207;Scheduler
sent command time
04/05/2006 15:48:48;0040;PBS_Server;Svr;pg-207;Scheduler
sent command time
04/05/2006 15:49:48;0040;PBS_Server;Svr;pg-207;Scheduler
sent command time
04/05/2006 15:50:48;0040;PBS_Server;Svr;pg-207;Scheduler
sent command time
04/05/2006 15:51:48;0040;PBS_Server;Svr;pg-207;Scheduler
sent command time
04/05/2006 15:52:48;0040;PBS_Server;Svr;pg-207;Scheduler
sent command time
04/05/2006 15:53:48;0040;PBS_Server;Svr;pg-207;Scheduler
sent command time
04/05/2006 15:54:48;0040;PBS_Server;Svr;pg-207;Scheduler
sent command time
04/05/2006 15:55:48;0040;PBS_Server;Svr;pg-207;Scheduler
sent command time
04/05/2006 15:56:44;0040;PBS_Server;Svr;pg-207;Scheduler
sent command new
04/05/2006 15:56:51;0040;PBS_Server;Svr;pg-207;Scheduler
sent command term
04/05/2006 15:57:15;0040;PBS_Server;Svr;pg-207;Scheduler
sent command new
04/05/2006 15:57:32;0040;PBS_Server;Svr;pg-207;Scheduler
sent command term
04/05/2006 15:57:56;0040;PBS_Server;Svr;pg-207;Scheduler
sent command new
04/05/2006 15:58:16;0040;PBS_Server;Svr;pg-207;Scheduler
sent command term
04/05/2006 15:58:19;0040;PBS_Server;Svr;pg-207;Scheduler
sent command new
04/05/2006 15:58:46;0040;PBS_Server;Svr;pg-207;Scheduler
sent command term
given the shelltest has two
'commands'
(hostname and date/time )
and
from google : the torque error
15010 is ' system error occurred'
It looks to me that the original 2
nodes work but the added nodes do
not. (logic 2 nodes * 2 commands = 4
errors)
/nc
Neil Costigan wrote:
>
Bernard Li wrote:
>
>> What's the output of 'pbsnodes
-a'?
>>
>>
>>
>
> pbsnodes -a
returns that all are unknown or down
>
> [EMAIL PROTECTED] oscar]#
pbsnodes -a
>
cc001.pg-207.computing.dcu.ie
> state =
state-unknown,down
> np =
1
> properties =
all
> ntype = cluster
>
>
cc002.pg-207.computing.dcu.ie
> state =
state-unknown,down
> np =
1
> properties =
all
> ntype = cluster
>
>
cc003.pg-207.computing.dcu.ie
> state =
state-unknown,down
> np =
1
> properties =
all
> ntype = cluster
>
>
cc004.pg-207.computing.dcu.ie
> state =
state-unknown,down
> np =
1
> properties =
all
> ntype = cluster
>
>
>>
Is pbs_mom running on all your client
nodes?
>>
>>
>>
>
> a ps aux | grep
pbs_mon on all nodes shows it is.
>
>
> i have tried moving
the pbs_oscar alias from the private to the public
> address in
/etc/hosts
> with no success
>
> to
recap.
>
> * OSCAR version
4.2.1b5
> * Fedora Core 3
> *
x86
>
> - successfully passed test_cluster after inital set up with
head node
> and two compute nodes. happy days.
> - test fails after
adding two new nodes which are up and alive. can
> mount /home and pass
ssh pings, pvm etc.
> but fail pbs
>
> /opt/pbs/bin/pbsnodes:
cannot connect to server pbs_oscar, error=111
> then fails with not enough
free nodes.
>
>
> /nc
>
>>
Cheers,
>>
>>
Bernard
>>
>>
>>
>>> well it was
going well
>>>
>>> i added two more
nodes
>>> and now it fails
>>>
>>>
[EMAIL PROTECTED] oscar]# testing/test_cluster
>>> Performing root
tests...
>>> Maui service
>>>
check:maui
>>>
[PASSED]
>>> Shutting down TORQUE
Server:
[ OK ]
>>> Connection refused
>>>
/opt/pbs/bin/pbsnodes: cannot connect to server pbs_oscar,
error=111
>>> Torque node
>>>
check
>>>
[PASSED]
>>> Starting TORQUE
Server:
[ OK ]
>>> Torque service
>>>
check:pbs_server
>>>
[PASSED]
>>> /home
mounts
>>>
[PASSED]
>>>
>>> Preparing user tests...
>>>
Performing user tests...
>>> SSH ping
>>>
test
>>>
[PASSED]
>>> SSH server-
>>>
>node
>>>
[PASSED]
>>> SSH node-
>>>
>server
>>>
[PASSED]
>>> Checking for 4 free
>>>
nodes:
>>>
[FAILED]
>>> Not enough free nodes. Tests
incomplete.
>>> Checking for 4 free
>>>
nodes:
>>>
[FAILED]
>>> Not enough free nodes. Tests
incomplete.
>>> Checking for 4 free
>>>
nodes:
>>>
[FAILED]
>>> Not enough free nodes. Tests
incomplete.
>>> Torque default queue
>>>
definition
>>>
[PASSED]
>>> Checking for 4 free
>>>
nodes:
>>>
[FAILED]
>>> Not enough free nodes. Tests
incomplete.
>>> Ganglia setup
>>>
test
>>>
[PASSED]
>>> Ganglia node count
>>>
test
>>>
[PASSED]
>>>
>>>
>>
>
>
>
-------------------------------------------------------
> This SF.Net
email is sponsored by xPML, a groundbreaking scripting
> language
>
that extends applications into web and mobile media. Attend the live
>
webcast
> and join the prime developer group breaking into this new
coding
> territory!
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
>
_______________________________________________
> Oscar-users mailing
list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/oscar-users
