Title: Re: [Oscar-users] help!. building client image (scientific linux 305)
Can you show us the output of pbsnodes -a now that you have resolved the problem with /etc/hosts?
 
I think your guess is right, and that the two new nodes cannot communicate with your pbs_server - did you check their error logs located in /var/spool/pbs?
 
Also, post lamtest.err and mpichtest.err as well, they may help to figure out what's wrong.
 
You did re-run "complete cluster setup" after your 2 new nodes were added, right?
 
Cheers,
 
Bernard


From: Neil Costigan [mailto:[EMAIL PROTECTED]
Sent: Wed 05/04/2006 08:41
To: Neil Costigan
Cc: Bernard Li; [email protected]
Subject: Re: [Oscar-users] help!. building client image (scientific linux 305)



I've managed to solve the issue with the pbs_nodes.
the client nodes had (in /var/spool/pbs/mom_priv/config) a name for the
head which didn't match the hosts file.
My best guess is this had to do with the fact that my heads nodes public
interface's IP/name was loaded by DHCP and when I swapped out the faulty
network card. The new MAC on the new network card would have aquired a
new DHCP lease and updated the head nodes name /hosts file.
manually corrected and works now.

my LAM and MPICH via torque still fail
the shellout.err has
::::::::::::::
/home/oscartst/torque/shelltest.err
::::::::::::::
error 15010 on spawn
error 15010 on spawn
error 15010 on spawn
error 15010 on spawn
::::::::::::::
/home/oscartst/torque/shelltest.out
::::::::::::::
cc002.pg-207.computing.dcu.ie
cc001.pg-207.computing.dcu.ie
Hello, date is 04/05/06, time is 15:58:34
Hello, date is 04/05/06, time is 15:58:46


pbs log on headnode gives  (editied to time)
04/05/2006 15:43:48;0040;PBS_Server;Svr;pg-207;Scheduler sent command time
04/05/2006 15:44:09;0040;PBS_Server;Req;is_stat_get;node
cc002.pg-207.computing.dcu.ie marked available
04/05/2006 15:44:16;0040;PBS_Server;Req;is_stat_get;node
cc003.pg-207.computing.dcu.ie marked available
04/05/2006 15:44:19;0040;PBS_Server;Req;is_stat_get;node
cc004.pg-207.computing.dcu.ie marked available
04/05/2006 15:44:26;0040;PBS_Server;Req;is_stat_get;node
cc001.pg-207.computing.dcu.ie marked available
04/05/2006 15:44:48;0040;PBS_Server;Svr;pg-207;Scheduler sent command time
04/05/2006 15:45:48;0040;PBS_Server;Svr;pg-207;Scheduler sent command time
04/05/2006 15:46:48;0040;PBS_Server;Svr;pg-207;Scheduler sent command time
04/05/2006 15:47:48;0040;PBS_Server;Svr;pg-207;Scheduler sent command time
04/05/2006 15:48:48;0040;PBS_Server;Svr;pg-207;Scheduler sent command time
04/05/2006 15:49:48;0040;PBS_Server;Svr;pg-207;Scheduler sent command time
04/05/2006 15:50:48;0040;PBS_Server;Svr;pg-207;Scheduler sent command time
04/05/2006 15:51:48;0040;PBS_Server;Svr;pg-207;Scheduler sent command time
04/05/2006 15:52:48;0040;PBS_Server;Svr;pg-207;Scheduler sent command time
04/05/2006 15:53:48;0040;PBS_Server;Svr;pg-207;Scheduler sent command time
04/05/2006 15:54:48;0040;PBS_Server;Svr;pg-207;Scheduler sent command time
04/05/2006 15:55:48;0040;PBS_Server;Svr;pg-207;Scheduler sent command time
04/05/2006 15:56:44;0040;PBS_Server;Svr;pg-207;Scheduler sent command new
04/05/2006 15:56:51;0040;PBS_Server;Svr;pg-207;Scheduler sent command term
04/05/2006 15:57:15;0040;PBS_Server;Svr;pg-207;Scheduler sent command new
04/05/2006 15:57:32;0040;PBS_Server;Svr;pg-207;Scheduler sent command term
04/05/2006 15:57:56;0040;PBS_Server;Svr;pg-207;Scheduler sent command new
04/05/2006 15:58:16;0040;PBS_Server;Svr;pg-207;Scheduler sent command term
04/05/2006 15:58:19;0040;PBS_Server;Svr;pg-207;Scheduler sent command new
04/05/2006 15:58:46;0040;PBS_Server;Svr;pg-207;Scheduler sent command term



given the shelltest has two 'commands'
(hostname and date/time )
and
from google : the torque error 15010 is ' system error occurred'

It looks to me that the original 2 nodes work but the added nodes do
not. (logic 2 nodes * 2 commands = 4 errors)


/nc




Neil Costigan wrote:

> Bernard Li wrote:
>
>> What's the output of 'pbsnodes -a'?
>>
>> 
>>
>
> pbsnodes -a returns that all are unknown or down
>
> [EMAIL PROTECTED] oscar]# pbsnodes -a
> cc001.pg-207.computing.dcu.ie
>     state = state-unknown,down
>     np = 1
>     properties = all
>     ntype = cluster
>
> cc002.pg-207.computing.dcu.ie
>     state = state-unknown,down
>     np = 1
>     properties = all
>     ntype = cluster
>
> cc003.pg-207.computing.dcu.ie
>     state = state-unknown,down
>     np = 1
>     properties = all
>     ntype = cluster
>
> cc004.pg-207.computing.dcu.ie
>     state = state-unknown,down
>     np = 1
>     properties = all
>     ntype = cluster
>
>
>> Is pbs_mom running on all your client nodes?
>>
>> 
>>
>
> a ps aux | grep pbs_mon on all nodes shows it is.
>
>
> i have tried moving the pbs_oscar alias from the private to the public
> address in /etc/hosts
> with no success
>
> to recap.
>
>    * OSCAR version 4.2.1b5
>    * Fedora Core 3
>    * x86
>
> - successfully passed test_cluster after inital set up with head node
> and two compute nodes. happy days.
> - test fails after adding two new nodes which are up and alive. can
> mount /home and pass ssh pings, pvm etc.
> but fail pbs
>
> /opt/pbs/bin/pbsnodes: cannot connect to server pbs_oscar, error=111
> then fails with not enough free nodes.
>
>
> /nc
>
>> Cheers,
>>
>> Bernard
>>
>> 
>>
>>> well it was going well
>>>
>>> i added two more nodes
>>> and now it fails
>>>
>>> [EMAIL PROTECTED] oscar]# testing/test_cluster
>>> Performing root tests...
>>> Maui service 
>>> check:maui                                                   
>>>                           [PASSED]
>>> Shutting down TORQUE Server:                               [  OK  ]
>>> Connection refused
>>> /opt/pbs/bin/pbsnodes: cannot connect to server pbs_oscar, error=111
>>> Torque node 
>>> check                                                        
>>>                            [PASSED]
>>> Starting TORQUE Server:                                    [  OK  ]
>>> Torque service 
>>> check:pbs_server                                             
>>>                         [PASSED]
>>> /home  mounts                                                       
>>>                                  [PASSED]
>>>
>>> Preparing user tests...
>>> Performing user tests...
>>> SSH ping 
>>> test                                                         
>>>                               [PASSED]
>>> SSH server-
>>> >node                                                       
>>>                              [PASSED]
>>> SSH node-
>>> >server                                                     
>>>                                [PASSED]
>>> Checking for 4 free 
>>> nodes:                                                       
>>>                    [FAILED]
>>> Not enough free nodes. Tests incomplete.
>>> Checking for 4 free 
>>> nodes:                                                       
>>>                    [FAILED]
>>> Not enough free nodes. Tests incomplete.
>>> Checking for 4 free 
>>> nodes:                                                       
>>>                    [FAILED]
>>> Not enough free nodes. Tests incomplete.
>>> Torque default queue 
>>> definition                                                   
>>>                   [PASSED]
>>> Checking for 4 free 
>>> nodes:                                                       
>>>                    [FAILED]
>>> Not enough free nodes. Tests incomplete.
>>> Ganglia setup 
>>> test                                                         
>>>                          [PASSED]
>>> Ganglia node count 
>>> test                                                         
>>>                     [PASSED]
>>>
>>>  
>>
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by xPML, a groundbreaking scripting
> language
> that extends applications into web and mobile media. Attend the live
> webcast
> and join the prime developer group breaking into this new coding
> territory!
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
> _______________________________________________
> Oscar-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/oscar-users


Reply via email to