Hello,
I could use some help figuring out why a reservation was preventing jobs from
running in a situation where they should have run. I'm using torque 2.3.6 and
maui 3.2.6p21. Our cluster is an SGI ICE system so the node names look like
rXiYnZZ where X is the rack number, Y is the IRU number (from 0-3) and ZZ is
the node number in the IRU (from 0-15). We have 2 fully populated racks so 8
IRU's/128 nodes total. I wanted to give some dedicated time to a few users for
75% of the machine for a 7 hours, which I did with the following command:
service0:~ # setres -n DAC -s 11:00_09/25 -d 0:07:00:00 -u
lumpkin:lebeau:kboyles:bstewart 'r1i[0-3]n[0-9]|r2i[0-1]n[0-9]'
reservation created
reservation 'testing.0' created on 96 nodes (1152 tasks)
r1i0n0:1
r1i0n1:1
r1i0n2:1
r1i0n3:1
r1i0n4:1
<clip>
All seemed well at this point. The DAC reservation should have reserved 96
nodes, leaving 32 left for other users. However, when the reservation took
effect yesterday there were several jobs that did not run that should have.
There were no other reservations or other policies in effect that should have
prevented jobs from running. We aren't using any QOS either - its pretty much
a FIFO queue with some soft and hard limits on the number of jobs and number of
procs for each user.
All of the commands below were taken just after the DAC reservation was in
effect. The first job in the queue (83219) should have run - there should have
been 32 nodes free. For some reason maui did not run it. But after that all 3
of the 8-node jobs should have run. Looking at the "checkjob -v 83224" output,
maui thinks that essentially all the nodes were reserved (except for 8 nodes
from 83223).
Any idea what might be going on here?
Thanks,
Darby
service0:~ # qstat -a
Req'd Elap
Job ID Username Queue Jobname NDS Time S Time
-------------------- -------- -------- ---------------- ----- ----- - -----
83219 aschwing huge m0.40a0.00_SAES 32 04:00 Q --
83223 stuart medium m0.27a30.0b20.0 8 04:00 R 01:55
83224 stuart medium m0.27a0.0b20.0 8 04:00 Q --
83225 stuart medium m0.27-30.0b20.0 8 04:00 Q --
service0:~ # checkjob -v 83224
checking job 83224 (RM job '83224.service0')
State: Idle
Creds: user:stuart group:eg3 class:medium qos:DEFAULT
WallTime: 00:00:00 of 4:00:00
SubmitTime: Sun Sep 25 10:51:08
(Time Queued Total: 00:17:40 Eligible: 00:17:32)
Total Tasks: 96
Req[0] TaskCount: 96 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [NONE]
Exec: '' ExecSize: 0 ImageSize: 0
Dedicated Resources Per Task: PROCS: 1
NodeAccess: SINGLEUSER
TasksPerNode: 12 NodeCount: 8
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 0
PartitionMask: [ALL]
SystemQueueTime: Sun Sep 25 10:51:16
PE: 96.00 StartPriority: 17
job cannot run in partition DEFAULT (idle procs do not meet requirements : 0 of
96 procs found)
idle procs: 1440 feasible procs: 0
Rejection Reasons: [State : 8][ReserveTime : 120]
Detailed Node Availability Information:
r1i0n0 rejected : ReserveTime
r1i0n1 rejected : ReserveTime
r1i0n2 rejected : ReserveTime
r1i0n3 rejected : ReserveTime
r1i0n4 rejected : ReserveTime
r1i0n5 rejected : ReserveTime
r1i0n6 rejected : ReserveTime
r1i0n7 rejected : ReserveTime
r1i0n8 rejected : ReserveTime
r1i0n9 rejected : ReserveTime
r1i0n10 rejected : ReserveTime
r1i0n11 rejected : ReserveTime
r1i0n12 rejected : ReserveTime
r1i0n13 rejected : ReserveTime
r1i0n14 rejected : ReserveTime
r1i0n15 rejected : ReserveTime
r1i1n0 rejected : ReserveTime
r1i1n1 rejected : ReserveTime
r1i1n2 rejected : ReserveTime
r1i1n3 rejected : ReserveTime
r1i1n4 rejected : ReserveTime
r1i1n5 rejected : ReserveTime
r1i1n6 rejected : ReserveTime
r1i1n7 rejected : ReserveTime
r1i1n8 rejected : ReserveTime
r1i1n9 rejected : ReserveTime
r1i1n10 rejected : ReserveTime
r1i1n11 rejected : ReserveTime
r1i1n12 rejected : ReserveTime
r1i1n13 rejected : ReserveTime
r1i1n14 rejected : ReserveTime
r1i1n15 rejected : ReserveTime
r1i2n0 rejected : ReserveTime
r1i2n1 rejected : ReserveTime
r1i2n2 rejected : ReserveTime
r1i2n3 rejected : ReserveTime
r1i2n4 rejected : ReserveTime
r1i2n5 rejected : ReserveTime
r1i2n6 rejected : ReserveTime
r1i2n7 rejected : ReserveTime
r1i2n8 rejected : ReserveTime
r1i2n9 rejected : ReserveTime
r1i2n10 rejected : ReserveTime
r1i2n11 rejected : ReserveTime
r1i2n12 rejected : ReserveTime
r1i2n13 rejected : ReserveTime
r1i2n14 rejected : ReserveTime
r1i2n15 rejected : ReserveTime
r1i3n0 rejected : ReserveTime
r1i3n1 rejected : ReserveTime
r1i3n2 rejected : ReserveTime
r1i3n3 rejected : ReserveTime
r1i3n4 rejected : ReserveTime
r1i3n5 rejected : ReserveTime
r1i3n6 rejected : ReserveTime
r1i3n7 rejected : ReserveTime
r1i3n8 rejected : ReserveTime
r1i3n9 rejected : ReserveTime
r1i3n10 rejected : ReserveTime
r1i3n11 rejected : ReserveTime
r1i3n12 rejected : ReserveTime
r1i3n13 rejected : ReserveTime
r1i3n14 rejected : ReserveTime
r1i3n15 rejected : ReserveTime
r2i0n0 rejected : ReserveTime
r2i0n1 rejected : ReserveTime
r2i0n2 rejected : ReserveTime
r2i0n3 rejected : ReserveTime
r2i0n4 rejected : ReserveTime
r2i0n5 rejected : ReserveTime
r2i0n6 rejected : ReserveTime
r2i0n7 rejected : ReserveTime
r2i0n8 rejected : ReserveTime
r2i0n9 rejected : ReserveTime
r2i0n10 rejected : ReserveTime
r2i0n11 rejected : ReserveTime
r2i0n12 rejected : ReserveTime
r2i0n13 rejected : ReserveTime
r2i0n14 rejected : ReserveTime
r2i0n15 rejected : ReserveTime
r2i1n0 rejected : ReserveTime
r2i1n1 rejected : ReserveTime
r2i1n2 rejected : ReserveTime
r2i1n3 rejected : ReserveTime
r2i1n4 rejected : ReserveTime
r2i1n5 rejected : ReserveTime
r2i1n6 rejected : ReserveTime
r2i1n7 rejected : ReserveTime
r2i1n8 rejected : ReserveTime
r2i1n9 rejected : ReserveTime
r2i1n10 rejected : ReserveTime
r2i1n11 rejected : ReserveTime
r2i1n12 rejected : ReserveTime
r2i1n13 rejected : ReserveTime
r2i1n14 rejected : ReserveTime
r2i1n15 rejected : ReserveTime
r2i2n0 rejected : ReserveTime
r2i2n1 rejected : ReserveTime
r2i2n2 rejected : ReserveTime
r2i2n3 rejected : ReserveTime
r2i2n4 rejected : ReserveTime
r2i2n5 rejected : ReserveTime
r2i2n6 rejected : ReserveTime
r2i2n7 rejected : ReserveTime
r2i2n8 rejected : State
r2i2n9 rejected : State
r2i2n10 rejected : State
r2i2n11 rejected : State
r2i2n12 rejected : State
r2i2n13 rejected : State
r2i2n14 rejected : State
r2i2n15 rejected : State
r2i3n0 rejected : ReserveTime
r2i3n1 rejected : ReserveTime
r2i3n2 rejected : ReserveTime
r2i3n3 rejected : ReserveTime
r2i3n4 rejected : ReserveTime
r2i3n5 rejected : ReserveTime
r2i3n6 rejected : ReserveTime
r2i3n7 rejected : ReserveTime
r2i3n8 rejected : ReserveTime
r2i3n9 rejected : ReserveTime
r2i3n10 rejected : ReserveTime
r2i3n11 rejected : ReserveTime
r2i3n12 rejected : ReserveTime
r2i3n13 rejected : ReserveTime
r2i3n14 rejected : ReserveTime
r2i3n15 rejected : ReserveTime
service0:~ # checknode r2i3n0
checking node r2i3n0
State: Idle (in current state for 00:01:02)
Configured Resources: PROCS: 12 MEM: 23G SWAP: 23G DISK: 1M
Utilized Resources: [NONE]
Dedicated Resources: [NONE]
Opsys: linux Arch: [NONE]
Speed: 1.00 Load: 0.000
Network: [DEFAULT]
Features: [NONE]
Attributes: [Batch]
Classes: [ginormous 12:12][debug 12:12][large 12:12][huge 12:12][medium
12:12][route 12:12][small 12:12][super 12:12][tiny 12:12]
Total Time: INFINITY Up: INFINITY (99.93%) Active: INFINITY (82.06%)
Reservations:
Job '83219'(x12) 2:03:58 -> 6:03:58 (4:00:00)
service0:~ # checknode r1i0n0
checking node r1i0n0
State: Idle (in current state for 00:01:33)
Configured Resources: PROCS: 12 MEM: 23G SWAP: 23G DISK: 1M
Utilized Resources: [NONE]
Dedicated Resources: [NONE]
Opsys: linux Arch: [NONE]
Speed: 1.00 Load: 0.000
Network: [DEFAULT]
Features: [NONE]
Attributes: [Batch]
Classes: [ginormous 12:12][debug 12:12][large 12:12][huge 12:12][medium
12:12][route 12:12][small 12:12][super 12:12][tiny 12:12]
Total Time: INFINITY Up: INFINITY (99.79%) Active: 77:16:44:11 (19.63%)
Reservations:
User 'DAC.0'(x1) -00:09:50 -> 6:50:10 (7:00:00)
Blocked Resources@-00:09:50 Procs: 12/12 (100.00%)
service0:~ # diagnose -r
Diagnosing Reservations
ResID Type Par StartTime EndTime Duration Node
Task Proc
----- ---- --- --------- ------- -------- ----
---- ----
DAC.0 User DEF -00:10:21 6:49:39 7:00:00 96
96 1152
Flags: PREEMPTEE
ACL: RES==DAC.0= USER==lumpkin+:==lebeau+:==kboyles+:==bstewart+
CL: RES==DAC.0
Task Resources: PROCS: [ALL]
Attributes (HostList='r1i[0-3]n[0-9]|r2i[0-1]n[0-9]')
Active PH: 0.00/202.16 (0.00%)
83223 Job DEF -1:57:04 2:02:56 4:00:00 8
96 96
ACL: JOB==83223=
CL: JOB==83223 USER==stuart GROUP==eg3 CLASS==medium QOS==DEFAULT
DURATION==4:00:00 PROC==96
debug.1.0 User DEF 20:49:39 1:05:49:39 9:00:00 8
8 96
Flags: STANDINGRES SHARED
ACL: RES==debug.1= CLASS==debug+
CL: RES==debug.1
Task Resources: PROCS: [ALL]
Attributes (HostList='r2i3n8 r2i3n9 r2i3n10 r2i3n11 r2i3n12 r2i3n13 r2i3n14
r2i3n15')
SRAttributes (TaskCount: 8 StartTime: 8:00:00 EndTime: 17:00:00 Days:
Mon,Tue,Wed,Thu,Fri)
83219 Job DEF 2:02:56 6:02:56 4:00:00 32
384 384
Flags: PREEMPTEE
ACL: JOB==83219=
CL: JOB==83219 USER==aschwing GROUP==eg3 CLASS==huge QOS==DEFAULT
DURATION==4:00:00 PROC==384
Attributes (Priority=56)
Active Reserved Processors: 96
service0:~ #
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers