Ronny,
I'm not sure what could be happenning on your side, but in case it is useful
I'll tell you how I got it partially working.
In my cluster I got
OS: CentOS 4.3
MPI: Open MPI 1.0.2
Torque: Torque 2.0.0p8
Maui: Maui 3.2.6p14
The torque configuration is not very exciting and the maui configuration is
minimal (I include them below).
With this I am able to submit a preemptable job with QOS:low (the default) by
doing:
[EMAIL PROTECTED] qsub -l nodes=4:ppn=4 submit-greetings
Submit-greetings is just:
#!/bin/bash
cd $PBS_O_WORKDIR
NP=$(wc -l $PBS_NODEFILE | awk '{print $1}')
cat $PBS_NODEFILE
/usr/local/openmpi/openmpi-1.0.2/bin/mpiexec -machinefile $PBS_NODEFILE -n $NP
./greetings
Then I can submit a preemptor job with:
[EMAIL PROTECTED] qsub -l nodes=4:ppn=4 submit-greetings -W x="QOS:hi"
With this Maui "suspends" the job successfully (at least that is what the
maui.log says):
04/20 11:29:10 INFO: 16 feasible tasks found for job 305:0 in partition
DEFAULT (16 Needed)
04/20 11:29:10 INFO: inadequate feasible tasks found for job 305:0 (0 < 16)
04/20 11:29:10 INFO: inadequate nodes found for job 305:0 (0 < 4)
04/20 11:29:10 MJobSelectPJobList(305,16,4,FJobList,PJList,PTCList,PNCList,PTL)
04/20 11:29:10 MRMJobSuspend(304,Msg,SC)
04/20 11:29:10 MPBSJobSuspend(304,BOLDO,Msg,SC)
04/20 11:29:10 INFO: job '304' successfully suspended
04/20 11:29:10 MResDestroy(304)
04/20 11:29:10 MResChargeAllocation(304,2)
04/20 11:29:10 INFO: attribute 'PREEMPTEE' set for job 304
04/20 11:29:10 ERROR: invalid nodelist for job 305:0 (inadequate taskcount,
0 < 16)
04/20 11:29:10 ERROR: cannot allocate nodes to job '305' in partition DEFAULT
04/20 11:29:10 MJobPReserve(305,DEFAULT,ResCount,ResCountRej)
04/20 11:29:10 MJobReserve(305,Priority)
04/20 11:29:10 INFO: 16 feasible tasks found for job 305:0 in partition
DEFAULT (16 Needed)
04/20 11:29:10 INFO: 16 feasible tasks found for job 305:0 in partition
DEFAULT (16 Needed)
04/20 11:29:10 INFO: located resources for 16 tasks (16) in best partition
DEFAULT for job 305 at time 00:00:01
04/20 11:29:10 INFO: tasks located for job 305: 16 of 16 required (16
feasible)
04/20 11:29:10 MJobDistributeTasks(305,BOLDO,NodeList,TaskMap)
04/20 11:29:10 MResJCreate(305,MNodeList,00:00:01,Priority,Res)
04/20 11:29:10 INFO: job '305' reserved 16 tasks (partition DEFAULT) to
start in 00:00:01 on Thu Apr 20 11:29:11
But the suspension is not perfect. Looking at the load in the different nodes, I
can see that in the node where the job started, all things are fine (I have 4
greetings processes stopped, state T), and four running, but in the other nodes
8 greetings processes are running...
Also the REMAINING time reported by Maui keeps decreasing for the Suspended job,
which is not ideal.
Anyone knows if these problems can be solved somehow?
Thanks,
Angel de Vicente
===============================
Torque configuration
--------------------
Qmgr: print server
#
# Create queues and set their attributes.
#
#
# Create and define queue batch
#
create queue batch
set queue batch queue_type = Execution
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime = 01:00:00
set queue batch enabled = True
set queue batch started = True
#
# Set server attributes.
#
set server scheduling = True
set server default_queue = batch
set server log_events = 511
set server mail_from = adm
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server pbs_version = 2.0.0p8
Qmgr: print queue batch
#
# Create queues and set their attributes.
#
#
# Create and define queue batch
#
create queue batch
set queue batch queue_type = Execution
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime = 01:00:00
set queue batch enabled = True
set queue batch started = True
Maui configuration
-------------------
# maui.cfg 3.2.6p14
SERVERHOST boldo
ADMIN1 root
RMCFG[BOLDO] TYPE=PBS
AMCFG[bank] TYPE=NONE
RMPOLLINTERVAL 00:00:30
SERVERPORT 42559
SERVERMODE NORMAL
LOGFILE maui.log
LOGFILEMAXSIZE 10000000
LOGLEVEL 3
# Job Priority: http://clusterresources.com/mauidocs/5.1jobprioritization.html
QUEUETIMEWEIGHT 1
QOSWEIGHT 10
BACKFILLPOLICY FIRSTFIT
RESERVATIONPOLICY CURRENTHIGHEST
NODEALLOCATIONPOLICY MINRESOURCE
# QOS: http://clusterresources.com/mauidocs/7.3qos.html
QOSCFG[hi] PRIORITY=100 QFLAGS=PREEMPTOR
QOSCFG[low] PRIORITY=-1000 QFLAGS=PREEMPTEE
CLASSCFG[batch] QDEF=low QLIST=hi:low
PREEMPTPOLICY SUSPEND
--
----------------------------------
http://www.iac.es/galeria/angelv/
PostDoc Software Support
Instituto de Astrofisica de Canarias
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers