DongInn
Did check these before but I re-checked as below:
1.      /etc/hosts are the same across the cluster.
2.      can ssh to a node and back without any problems or password. The 
known_hosts file has been updated and copied across the cluster.
3.      checked nagios/nrpe and it is setup to allow the admin node to collect 
details.
4.      ganglia/gmond is setup to talk to the admin node.
5.      pbs_server and maui on the admin have been restarted with no reported 
errors in the log files.
6.      pbs_mom on the nodes has been restarted with no reported errors in the 
log files.
7.      a search through /etc and /var/lib/torque for the ip-address of the 
server doesn't find anything other old log entries.
8.      /etc/dhcp/dhcpd.conf has been updated.
9.       /etc/ntp.conf has been updated across the cluster.

Thanks 

---------------------------------------------------------------------
Richard A. Young
ICT Services
Email: richard.yo...@usq.edu.au   Phone: (07) 46315557   
Mob:   0437544370          Fax:   (07) 46312798 
---------------------------------------------------------------------

-----Original Message-----
From: Kim, DongInn [mailto:di...@indiana.edu] 
Sent: Tuesday, 31 May 2016 12:05 PM
To: Users OSCAR
Subject: Re: [Oscar-users] Jobs not running on reconfigured cluster

Hi Richard,

I would like to double check the following items if I were you.

1. /etc/hosts, ssh keys, nagios/nrpe, gmetad/gmond are all synced through all 
the nodes.
2. Make sure that the root user can ssh into all the nodes back and forth 
without password.
3. All the daemons of the job submission are running on all the nodes:
    (torque-server, torque-mom in the head node and torque-mom in the client 
nodes and maui on the head node)
    I assume that you are using torque as RM and maui as a scheduler.

Regards,

--
- DongInn



> On May 30, 2016, at 7:25 PM, Richard Young <richard.yo...@usq.edu.au> wrote:
> 
> I was hoping somebody would be able to help me with the following problem.
> 
> Recently I have applied updates and done some reconfiguration on a RHEL6.8 
> cluster running Oscar. The major change was changing the ipaddress of the 
> oscar_server, this was required because changes to the network structure. The 
> ipaddress has been applied to /etc/hosts, ssh keys, nagios/nrpe, gmetad/gmond 
> etc. However, I have missed something because no jobs will now run on the 
> cluster. The jobs basically site in the queue and then get cancelled because 
> they have hit their walltime.
> 
> Has anybody come across this problem before and be able to supply some 
> insight into how to fix the problem(s).
> 
> Thanks
> 
> ---------------------------------------------------------------------
> Richard A. Young
> ICT Services
> HPC Systems Engineer
> University of Southern Queensland
> Toowoomba, Queensland 4350
> Australia
> Email: richard.yo...@usq.edu.au   Phone: (07) 46315557
> Mob:   0437544370          Fax:   (07) 46312798
> ---------------------------------------------------------------------
> 
> 
> 
> _____________________________________________________________
> This email (including any attached files) is confidential and is for the 
> intended recipient(s) only. If you received this email by mistake, please, as 
> a courtesy, tell the sender, then delete this email.
> 
> The views and opinions are the originator's and do not necessarily reflect 
> those of the University of Southern Queensland. Although all reasonable 
> precautions were taken to ensure that this email contained no viruses at the 
> time it was sent we accept no liability for any losses arising from its 
> receipt.
> 
> The University of Southern Queensland is a registered provider of education 
> with the Australian Government.
> (CRICOS Institution Code QLD 00244B / NSW 02225M, TEQSA PRV12081 )
> 
> 
> ------------------------------------------------------------------------------
> What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
> patterns at an interface-level. Reveals which users, apps, and protocols are
> consuming the most bandwidth. Provides multi-vendor support for NetFlow,
> J-Flow, sFlow and other flows. Make informed decisions using capacity
> planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
> _______________________________________________
> Oscar-users mailing list
> Oscar-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/oscar-users



_____________________________________________________________
This email (including any attached files) is confidential and is for the 
intended recipient(s) only. If you received this email by mistake, please, as a 
courtesy, tell the sender, then delete this email.

The views and opinions are the originator's and do not necessarily reflect 
those of the University of Southern Queensland. Although all reasonable 
precautions were taken to ensure that this email contained no viruses at the 
time it was sent we accept no liability for any losses arising from its receipt.

The University of Southern Queensland is a registered provider of education 
with the Australian Government.
(CRICOS Institution Code QLD 00244B / NSW 02225M, TEQSA PRV12081 )


------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are 
consuming the most bandwidth. Provides multi-vendor support for NetFlow, 
J-Flow, sFlow and other flows. Make informed decisions using capacity 
planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
_______________________________________________
Oscar-users mailing list
Oscar-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oscar-users

Reply via email to