|
Hardware : Dual-Opteron 246 Motherboard Tyan S2885
+ gigabit switch 3-com
OS : Fedora Core 2 i386 + OSCAR 4.0 Cluster : 4 Nodes + 1 Front-end www.ird.nc/UR65/ROMS Hi,
I would have confirmation about this : http://sourceforge.net/mailarchive/message.php?msg_id=8851042,
because this can explain my trouble.
" In PBS, qstat will
show all jobs and their state. Keep in mind, that in
typical OSCAR clusters, it is Maui (the job scheduler) which reads PBS"s information about nodes and queues, and instructs PBS on when to run a given job. If a job isn"t running, Maui may be the place to look. #1 Make sure Maui is running. #2 Make sure pbs_sched (PBS"s included dumbed-down FIFO scheduler) isn"t running and locking the pbs_server port #3 Use Maui utilities (checkjob,showq?) to investigate and find out why Maui is or isn"t running a given job" I have misery with MAUI. If i start MAUI,
pbs_server complains there is no access to scheduler with a message like
:
"Connection refused (111) in contact_sched, Could
not contact Scheduler"
I suppose the reason is MAUI, which stop
running immediately after start. I don't know why yet.
If i try to start pbs_scheduler, all is fine :
06/27/2005
16:02:09;0040;PBS_Server;Svr;editr.cluster.ird.nc;Scheduler sent command
time
06/27/2005 16:03:06;0040;PBS_Server;Svr;editr.cluster.ird.nc;Scheduler sent command new 06/27/2005 16:03:06;0040;PBS_Server;Svr;editr.cluster.ird.nc;Scheduler sent command recyc I can make some pbs command, like qstat, qsub, qrun
etc... With lamboot -v or pbsnodes -a i can see all my node and they are
free.
But, strangly, it seems PBS don't get the correct
node map (There is always 1 ncpus in Ressource_List) and i don't find pbs.conf.
Par exemple :
[EMAIL PROTECTED]
umr65]$ qsub -lnodes=2 -I
qsub: waiting for job 47.editr.cluster.ird.nc to start Do you wish to terminate the job and exit (y|[n])? y Job 47.editr.cluster.ird.nc is being deleted [EMAIL PROTECTED] umr65]$ qstat -f Job Id: 47.editr.cluster.ird.nc Job_Name = STDIN Job_Owner = [EMAIL PROTECTED] job_state = Q queue = workq server = editr.cluster.ird.nc Checkpoint = u ctime = Thu Jun 30 17:04:02 2005 Error_Path = editr.cluster.ird.nc:/home/umr65/STDIN.e47 exec_host = node1.cluster.ird.nc/0+editr.cluster.ird.nc/0 Hold_Types = n interactive = True Join_Path = n Keep_Files = n Mail_Points = a mtime = Thu Jun 30 17:04:15 2005 Output_Path = editr.cluster.ird.nc:/home/umr65/STDIN.o47 Priority = 0 qtime = Thu Jun 30 17:04:02 2005 Rerunable = True Resource_List.cput = 10000:00:00 Resource_List.ncpus = 1 Resource_List.nodect = 2 Resource_List.nodes = 2 Resource_List.walltime = 10000:00:00 Variable_List = PBS_O_HOME=/home/umr65,PBS_O_LANG=fr_FR.UTF-8, PBS_O_LOGNAME=umr65, PBS_O_PATH=/opt/intel_fc_81/bin:/usr/pgi/linux86/5.2/bin:/opt/intel_fc _81/bin:/usr/pgi/linux86/5.2/bin:/usr/kerberos/bin:/opt/lam-7.1_pgi/bin :/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/opt/env-switcher/bin:/opt /kernel_picker/bin:/opt/pvm3/lib:/opt/pvm3/lib/LINUX:/opt/pvm3/bin/LINU X:/opt/c3-4/:/opt/pbs/bin:/opt/pbs/lib/xpbs/bin:/opt/ferret_V58/bin:/op t/netcdf-3.6_pgi/bin:/opt/NCO_300/bin:/home/umr65/bin:./:/opt/ferret_V5 8/bin:/opt/netcdf-3.6_pgi/bin:/opt/NCO_300/bin, PBS_O_MAIL=/var/spool/mail/umr65,PBS_O_SHELL=/bin/bash, PBS_O_HOST=editr.cluster.ird.nc,PBS_O_WORKDIR=/home/umr65, PBS_O_QUEUE=workq comment = Job started on Thu Jun 30 at 17:04 etime = Thu Jun 30 17:04:02 2005 [EMAIL PROTECTED] umr65]$ ******************************************************************** So, my question :
Do you think this trouble is 100% reliable
with MAUI, which is stopped ?
Is MAUI the default scheduler and without it,
PBS may have trouble, even pbs_sched is running ?
In default Oscar Cluster, Is PBS capable to
manage a job without infos from pbs.conf, but only with probing infos from
MAUI ?
Many thanks for your help and
confirmation.
Cheers
Jerome Lefevre
|
