Dear list,
I am trying to setup a very basic Torque+Maui system. I am running a
Torque cluster for a year now, and wanted to improve the scheduling with
Maui. To this end, I installed a fresh test-system, with server and node
on a single computer.
Torque version: 2.4.16
Maui version: 3.3.1
uname: Linux testing 3.2.0-20-generic #33-Ubuntu SMP Tue Mar 27 16:42:26
UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
I was able to run (simple) jobs with the Torque scheduler. When I
replaced the scheduler with Maui, jobs stay queued. Jobs are submitted by:
$ qsub -q batch test-script.sh
where test-script.sh is nothing more than a 'sleep 1m' script. Checking
the job:
# checkjob -v 55
checking job 55 (RM job '55.testing.azr.nl')
State: Idle EState: Deferred
Creds: user:sebastiaan group:sebastiaan class:batch qos:DEFAULT
WallTime: 00:00:00 of 6:00:00
SubmitTime: Thu Apr 5 13:21:33
(Time Queued Total: 00:00:32 Eligible: 00:00:01)
Total Tasks: 1
Req[0] TaskCount: 1 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 15G
Opsys: [NONE] Arch: [NONE] Features: [1][ppn=1]
Exec: '' ExecSize: 0 ImageSize: 0
Dedicated Resources Per Task: PROCS: 1 MEM: 2000M SWAP: 15G
NodeAccess: SHARED
NodeCount: 1
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 0
PartitionMask: [ALL]
Flags: RESTARTABLE
job is deferred. Reason: NoResources (cannot create reservation for
job '55' (intital reservation attempt)
)
Holds: Defer (hold reason: NoResources)
PE: 16.03 StartPriority: 1
cannot select job 55 for partition DEFAULT (job hold active)
show that there are no resources available. The node is free, and unloaded:
# checknode testing
checking node testing.azr.nl
State: Idle (in current state for 2:23:54)
Configured Resources: PROCS: 2 MEM: 984M SWAP: 1996M DISK: 1M
Utilized Resources: SWAP: 149M
Dedicated Resources: [NONE]
Opsys: linux Arch: [NONE]
Speed: 1.00 Load: 0.050
Network: [DEFAULT]
Features: [NONE]
Attributes: [Batch]
Classes: [batch 2:2]
Total Time: 16:11:49 Up: 16:11:49 (100.00%) Active: 00:01:00 (0.10%)
Reservations:
NOTE: no reservations on node
When the job is added, maui.log shows this:
04/05 13:21:34 MPBSJobLoad(55,55.testing.azr.nl,J,TaskList,0)
04/05 13:21:34 MReqCreate(55,SrcRQ,DstRQ,DoCreate)
04/05 13:21:34 INFO: processing node request line '1'
04/05 13:21:34 MJobSetCreds(55,sebastiaan,sebastiaan,)
04/05 13:21:34 INFO: default QOS for job 55 set to DEFAULT(0)
(P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])
04/05 13:21:34 INFO: default QOS for job 55 set to DEFAULT(0)
(P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])
04/05 13:21:34 INFO: default QOS for job 55 set to DEFAULT(0)
(P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])
04/05 13:21:34 INFO: job '55' loaded: 1 sebastiaan sebastiaan
21600 Idle 0 1333624893 [NONE] [NONE] [NONE] >= 0 >=
0 [1][ppn=1] 1333624894
04/05 13:21:34 INFO: 12 PBS jobs detected on RM TESTING
04/05 13:21:34 INFO: jobs detected: 12
04/05 13:21:34 MStatClearUsage(node,Active)
04/05 13:21:34 MClusterUpdateNodeState()
04/05 13:21:34 MQueueSelectAllJobs(Q,HARD,ALL,JIList,DP,Msg)
04/05 13:21:34 INFO: job '40' Priority: 22
04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr:
0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0)
Us: 0(00.0)
04/05 13:21:34 INFO: job '41' Priority: 22
04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr:
0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0)
Us: 0(00.0)
04/05 13:21:34 INFO: job '42' Priority: 22
04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr:
0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0)
Us: 0(00.0)
04/05 13:21:34 INFO: job '44' Priority: 22
04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr:
0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0)
Us: 0(00.0)
04/05 13:21:34 INFO: job '45' Priority: 22
04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr:
0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0)
Us: 0(00.0)
04/05 13:21:34 INFO: job '47' Priority: 22
04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr:
0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0)
Us: 0(00.0)
04/05 13:21:34 INFO: job '48' Priority: 16
04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr:
0(00.0) Serv: 16(00.0) Targ: 0(00.0) Res: 0(00.0)
Us: 0(00.0)
04/05 13:21:34 INFO: job '49' Priority: 12
04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr:
0(00.0) Serv: 12(00.0) Targ: 0(00.0) Res: 0(00.0)
Us: 0(00.0)
04/05 13:21:34 INFO: job '52' Priority: 8
04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr:
0(00.0) Serv: 8(00.0) Targ: 0(00.0) Res: 0(00.0)
Us: 0(00.0)
04/05 13:21:34 INFO: job '53' Priority: 1
04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr:
0(00.0) Serv: 0(00.0) Targ: 0(00.0) Res: 0(00.0)
Us: 0(00.0)
04/05 13:21:34 INFO: job '54' Priority: 60
04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr:
0(00.0) Serv: 60(00.0) Targ: 0(00.0) Res: 0(00.0)
Us: 0(00.0)
04/05 13:21:34 INFO: job '55' Priority: 1
04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr:
0(00.0) Serv: 0(00.0) Targ: 0(00.0) Res: 0(00.0)
Us: 0(00.0)
04/05 13:21:34 MStatClearUsage([NONE],Active)
04/05 13:21:34 MResDestroy(NULL)
04/05 13:21:34 MResDestroy(NULL)
04/05 13:21:34 MResDestroy(NULL)
04/05 13:21:34 MResDestroy(NULL)
04/05 13:21:34 MResDestroy(NULL)
04/05 13:21:34 MResDestroy(NULL)
04/05 13:21:34 MResDestroy(NULL)
04/05 13:21:34 MResDestroy(NULL)
04/05 13:21:34 MResDestroy(NULL)
04/05 13:21:34 MResDestroy(NULL)
04/05 13:21:34 MResDestroy(NULL)
04/05 13:21:34 INFO: total jobs selected (ALL): 1/12 [EState: 11]
04/05 13:21:34 MQueueSelectAllJobs(Q,SOFT,ALL,JIList,DP,Msg)
04/05 13:21:34 INFO: job '40' Priority: 22
04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr:
0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0)
Us: 0(00.0)
04/05 13:21:34 INFO: job '41' Priority: 22
04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr:
0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0)
Us: 0(00.0)
04/05 13:21:34 INFO: job '42' Priority: 22
04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr:
0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0)
Us: 0(00.0)
04/05 13:21:34 INFO: job '44' Priority: 22
04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr:
0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0)
Us: 0(00.0)
04/05 13:21:34 INFO: job '45' Priority: 22
04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr:
0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0)
Us: 0(00.0)
04/05 13:21:34 INFO: job '47' Priority: 22
04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr:
0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0)
Us: 0(00.0)
04/05 13:21:34 INFO: job '48' Priority: 16
04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr:
0(00.0) Serv: 16(00.0) Targ: 0(00.0) Res: 0(00.0)
Us: 0(00.0)
04/05 13:21:34 INFO: job '49' Priority: 12
04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr:
0(00.0) Serv: 12(00.0) Targ: 0(00.0) Res: 0(00.0)
Us: 0(00.0)
04/05 13:21:34 INFO: job '52' Priority: 8
04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr:
0(00.0) Serv: 8(00.0) Targ: 0(00.0) Res: 0(00.0)
Us: 0(00.0)
04/05 13:21:34 INFO: job '53' Priority: 1
04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr:
0(00.0) Serv: 0(00.0) Targ: 0(00.0) Res: 0(00.0)
Us: 0(00.0)
04/05 13:21:34 INFO: job '54' Priority: 60
04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr:
0(00.0) Serv: 60(00.0) Targ: 0(00.0) Res: 0(00.0)
Us: 0(00.0)
04/05 13:21:34 INFO: job '55' Priority: 1
04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr:
0(00.0) Serv: 0(00.0) Targ: 0(00.0) Res: 0(00.0)
Us: 0(00.0)
04/05 13:21:34 MStatClearUsage([NONE],Idle)
04/05 13:21:34 MResDestroy(NULL)
04/05 13:21:34 MResDestroy(NULL)
04/05 13:21:34 MResDestroy(NULL)
04/05 13:21:34 MResDestroy(NULL)
04/05 13:21:34 MResDestroy(NULL)
04/05 13:21:34 MResDestroy(NULL)
04/05 13:21:34 MResDestroy(NULL)
04/05 13:21:34 MResDestroy(NULL)
04/05 13:21:34 MResDestroy(NULL)
04/05 13:21:34 MResDestroy(NULL)
04/05 13:21:34 MResDestroy(NULL)
04/05 13:21:34 INFO: total jobs selected (ALL): 1/12 [EState: 11]
04/05 13:21:34
MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,EVERY,FReason,FALSE)
04/05 13:21:34 INFO: total jobs selected in partition ALL: 1/1
04/05 13:21:34 MQueueScheduleRJobs(Q)
04/05 13:21:34
MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,EVERY,FReason,TRUE)
04/05 13:21:34 INFO: total jobs selected in partition ALL: 1/1
04/05 13:21:34
MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,DEFAULT,FReason,TRUE)
04/05 13:21:34 INFO: total jobs selected in partition DEFAULT: 1/1
04/05 13:21:34 MQueueScheduleIJobs(Q,DEFAULT)
04/05 13:21:34 INFO: 0 feasible tasks found for job 55:0 in
partition DEFAULT (1 Needed)
04/05 13:21:34 MJobPReserve(55,DEFAULT,ResCount,ResCountRej)
04/05 13:21:34 MJobReserve(55,Priority)
04/05 13:21:34 ALERT: job 55 cannot run in any partition
04/05 13:21:34 ALERT: cannot create new reservation for job 55
(shape[1] 1)
04/05 13:21:34 ALERT: cannot create new reservation for job 55
04/05 13:21:34 MJobSetHold(55,16,1:00:00,NoResources,cannot create
reservation for job '55' (intital reservation attempt)
)
04/05 13:21:34 ALERT: job '55' cannot run (deferring job for 3600
seconds)
04/05 13:21:34 WARNING: cannot reserve priority job '55'
Active Jobs------
------------------
04/05 13:21:34 INFO: resources available after scheduling: N: 1 P: 2
04/05 13:21:34
MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,DEFAULT,FReason,TRUE)
04/05 13:21:34 INFO: total jobs selected in partition DEFAULT: 0/1
[EState: 1]
04/05 13:21:34
MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,EVERY,FReason,TRUE)
04/05 13:21:34 INFO: total jobs selected in partition ALL: 0/1
[EState: 1]
04/05 13:21:34
MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,EVERY,FReason,TRUE)
04/05 13:21:34 INFO: total jobs selected in partition ALL: 0/1
[EState: 1]
04/05 13:21:34 MSchedUpdateStats()
04/05 13:21:34 INFO: iteration: 288 scheduling time: 0.002 seconds
04/05 13:21:34 MResUpdateStats()
04/05 13:21:34 INFO: current util[288]: 0/1 (0.00%) PH: 0.00%
active jobs: 0 of 2 (completed: 1)
04/05 13:21:34 MQueueCheckStatus()
04/05 13:21:34 MNodeCheckStatus()
04/05 13:21:34 MUClearChild(PID)
04/05 13:21:34 INFO: scheduling complete. sleeping 30 seconds
I think the relevant line is:
04/05 13:21:34 INFO: 0 feasible tasks found for job 55:0 in
partition DEFAULT (1 Needed)
but I have no idea how to make a feasible task for the job. I have tried
queueing with -l nodes=1:ppn=1 -l walltime=2:00:00, etc. but none seem
to have had effect.
Torque config. I have tried setting different attributes to the queue
properties, hoping that it would have some effect:
# qmgr -c "p s"
#
# Create queues and set their attributes.
#
#
# Create and define queue batch
#
create queue batch
set queue batch queue_type = Execution
set queue batch Priority = 20
set queue batch max_running = 8
set queue batch resources_max.ncpus = 8
set queue batch resources_max.nodect = 10
set queue batch resources_max.nodes = 2
set queue batch resources_min.ncpus = 0
set queue batch resources_default.mem = 2000mb
set queue batch resources_default.ncpus = 1
set queue batch resources_default.neednodes = 1:ppn=1
set queue batch resources_default.nodect = 1
set queue batch resources_default.nodes = 1
set queue batch resources_default.pvmem = 16000mb
set queue batch resources_default.walltime = 06:00:00
set queue batch enabled = True
set queue batch started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = testing.azr.nl
set server log_events = 511
set server mail_from = adm
set server resources_available.nodect = 10
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server next_job_number = 56
Maui configuration, untouched:
# maui.cfg 3.3.1
SERVERHOST testing
# primary admin must be first in list
ADMIN1 root
# Resource Manager Definition
RMCFG[TESTING] TYPE=PBS
# Allocation Manager Definition
AMCFG[bank] TYPE=NONE
# full parameter docs at http://supercluster.org/mauidocs/a.fparameters.html
# use the 'schedctl -l' command to display current configuration
RMPOLLINTERVAL 00:00:30
SERVERPORT 42559
SERVERMODE NORMAL
# Admin: http://supercluster.org/mauidocs/a.esecurity.html
LOGFILE maui.log
LOGFILEMAXSIZE 10000000
LOGLEVEL 3
# Job Priority: http://supercluster.org/mauidocs/5.1jobprioritization.html
QUEUETIMEWEIGHT 1
# FairShare: http://supercluster.org/mauidocs/6.3fairshare.html
#FSPOLICY PSDEDICATED
#FSDEPTH 7
#FSINTERVAL 86400
#FSDECAY 0.80
# Throttling Policies:
http://supercluster.org/mauidocs/6.2throttlingpolicies.html
# NONE SPECIFIED
# Backfill: http://supercluster.org/mauidocs/8.2backfill.html
BACKFILLPOLICY FIRSTFIT
RESERVATIONPOLICY CURRENTHIGHEST
# Node Allocation: http://supercluster.org/mauidocs/5.2nodeallocation.html
NODEALLOCATIONPOLICY MINRESOURCE
# QOS: http://supercluster.org/mauidocs/7.3qos.html
# QOSCFG[hi] PRIORITY=100 XFTARGET=100 FLAGS=PREEMPTOR:IGNMAXJOB
# QOSCFG[low] PRIORITY=-1000 FLAGS=PREEMPTEE
# Standing Reservations:
http://supercluster.org/mauidocs/7.1.3standingreservations.html
# SRSTARTTIME[test] 8:00:00
# SRENDTIME[test] 17:00:00
# SRDAYS[test] MON TUE WED THU FRI
# SRTASKCOUNT[test] 20
# SRMAXTIME[test] 0:30:00
# Creds: http://supercluster.org/mauidocs/6.1fairnessoverview.html
# USERCFG[DEFAULT] FSTARGET=25.0
# USERCFG[john] PRIORITY=100 FSTARGET=10.0-
# GROUPCFG[staff] PRIORITY=1000 QLIST=hi:low QDEF=hi
# CLASSCFG[batch] FLAGS=PREEMPTEE
# CLASSCFG[interactive] FLAGS=PREEMPTOR
Any ideas?
Thanks in advance,
Sebastiaan
--
Sebastiaan Breedveld, MSc.
Ph.D. student
Erasmus MC - Daniel den Hoed Cancer Center
Department of Radiation Oncology
Groene Hilledijk 301
3075 EA Rotterdam
The Netherlands
Phone: +31 10 7042693
Room: Gs-20
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers