Hardware : Dual-Opteron 246 Motherboard Tyan S2885 + gigabit switch 3-com
OS : Fedora Core 2 i386 + OSCAR 4.0
Cluster : 4 Nodes + 1 Front-end
www.ird.nc/UR65/ROMS
 
Hi,
 
I would have confirmation about this : http://sourceforge.net/mailarchive/message.php?msg_id=8851042, because this can explain my trouble.
" In PBS, qstat will show all jobs and their state.  Keep in mind, that in
typical OSCAR clusters, it is Maui (the job scheduler) which reads PBS"s
information about nodes and queues, and instructs PBS on when to run a
given job.  If a job isn"t running, Maui may be the place to look.
#1  Make sure Maui is running.
#2  Make sure pbs_sched (PBS"s included dumbed-down FIFO scheduler) isn"t
running and locking the pbs_server port
#3  Use Maui utilities (checkjob,showq?) to investigate and find out why
Maui is or isn"t running a given job"
 
 
I have misery with MAUI. If i start MAUI, pbs_server complains there is no access to scheduler with a message like :
"Connection refused (111) in contact_sched, Could not contact Scheduler"
 
I suppose the reason is MAUI, which stop running immediately after start. I don't know why yet.
 
If i try to start pbs_scheduler, all is fine :
06/27/2005 16:02:09;0040;PBS_Server;Svr;editr.cluster.ird.nc;Scheduler sent command time
06/27/2005 16:03:06;0040;PBS_Server;Svr;editr.cluster.ird.nc;Scheduler sent command new
06/27/2005 16:03:06;0040;PBS_Server;Svr;editr.cluster.ird.nc;Scheduler sent command recyc
 
I can make some pbs command, like qstat, qsub, qrun etc... With lamboot -v or pbsnodes -a i can see all my node and they are free.
But, strangly, it seems PBS don't get the correct node map (There is always 1 ncpus in Ressource_List) and i don't find pbs.conf.
 
Par exemple :
[EMAIL PROTECTED] umr65]$  qsub -lnodes=2 -I
qsub: waiting for job 47.editr.cluster.ird.nc to start
Do you wish to terminate the job and exit (y|[n])? y
Job 47.editr.cluster.ird.nc is being deleted
[EMAIL PROTECTED] umr65]$ qstat -f
Job Id: 47.editr.cluster.ird.nc
    Job_Name = STDIN
    Job_Owner =
[EMAIL PROTECTED]
    job_state = Q
    queue = workq
    server = editr.cluster.ird.nc
    Checkpoint = u
    ctime = Thu Jun 30 17:04:02 2005
    Error_Path = editr.cluster.ird.nc:/home/umr65/STDIN.e47
    exec_host = node1.cluster.ird.nc/0+editr.cluster.ird.nc/0
    Hold_Types = n
    interactive = True
    Join_Path = n
    Keep_Files = n
    Mail_Points = a
    mtime = Thu Jun 30 17:04:15 2005
    Output_Path = editr.cluster.ird.nc:/home/umr65/STDIN.o47
    Priority = 0
    qtime = Thu Jun 30 17:04:02 2005
    Rerunable = True
    Resource_List.cput = 10000:00:00
    Resource_List.ncpus = 1
    Resource_List.nodect = 2
    Resource_List.nodes = 2
    Resource_List.walltime = 10000:00:00
    Variable_List = PBS_O_HOME=/home/umr65,PBS_O_LANG=fr_FR.UTF-8,
        PBS_O_LOGNAME=umr65,

PBS_O_PATH=/opt/intel_fc_81/bin:/usr/pgi/linux86/5.2/bin:/opt/intel_fc
_81/bin:/usr/pgi/linux86/5.2/bin:/usr/kerberos/bin:/opt/lam-7.1_pgi/bin
:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/opt/env-switcher/bin:/opt
/kernel_picker/bin:/opt/pvm3/lib:/opt/pvm3/lib/LINUX:/opt/pvm3/bin/LINU
X:/opt/c3-4/:/opt/pbs/bin:/opt/pbs/lib/xpbs/bin:/opt/ferret_V58/bin:/op
t/netcdf-3.6_pgi/bin:/opt/NCO_300/bin:/home/umr65/bin:./:/opt/ferret_V5
        8/bin:/opt/netcdf-3.6_pgi/bin:/opt/NCO_300/bin,
        PBS_O_MAIL=/var/spool/mail/umr65,PBS_O_SHELL=/bin/bash,
        PBS_O_HOST=editr.cluster.ird.nc,PBS_O_WORKDIR=/home/umr65,
        PBS_O_QUEUE=workq
    comment = Job started on Thu Jun 30 at 17:04
    etime = Thu Jun 30 17:04:02 2005
 [EMAIL PROTECTED] umr65]$
 ********************************************************************
 
So, my question :
Do you think this trouble is 100% reliable with MAUI, which is stopped ?
Is MAUI the default scheduler and without it, PBS may have trouble, even pbs_sched is running ?
In default Oscar Cluster, Is PBS capable to manage a job without infos from pbs.conf, but only with probing infos from MAUI ?
 
Many thanks for your help and confirmation.
 
Cheers
 
Jerome Lefevre
 
 
 
 
 
 
 
 

Reply via email to