Environment
-----------
maui-3.2.6p13
torque-1.1.0p6
linux cluster
I had this working. Or so I thought. But after a pbs-server reboot and
a maui reboot, jobs just defer.
I have two nodes with quad processors that I wish to allow only jobs
specifying the #PBS -q quad can have access to and run.
-------------------------------------------------
In Torque:
create queue quad
set queue quad queue_type = Execution
set queue quad acl_hosts = node076+node077
set queue quad resources_max.nodect = 2
set queue quad enabled = True
set queue quad started = True
---------------------------------------------------
In maui:
SRCFG[quad] HOSTLIST=node076,node077
SRCFG[quad] FLAGS=BYNAME
SRCFG[quad] PERIOD=INFINITY
SRCFG[quad] CLASSLIST=quad
CLASSCFG[quad] PRIORITY=0
CLASSCFG[quad] FLAGS=ADVRES:quad.0.0
------------------------------------------------------
diagnose -r
quad.0.0 User DEF -00:13:26 INFINITY INFINITY
2 2 8
Flags: STANDINGRES BYNAME
ACL: RES==quad.0= CLASS==quad+
CL: RES==quad.0
Task Resources: PROCS: [ALL]
Attributes (HostList='node076 node077')
Active PH: 0.00/1.79 (0.00%)
SRAttributes (TaskCount: 0 StartTime: 00:00:00 EndTime:
1:00:00:00 Days: ALL)
---------------------------------------------------------------------------
so the reservation is there and appears active. But when I do a
"checknode node077" I see that in reservations there is something which
doesn't seem correct.
----------------------------------------------------------------------------
checking node node077
State: Idle (in current state for 00:13:26)
Configured Resources: PROCS: 4 MEM: 15G SWAP: 16G DISK: 1M
Utilized Resources: [NONE]
Dedicated Resources: [NONE]
Opsys: DEFAULT Arch: linux
Speed: 1.00 Load: 0.000
Network: [DEFAULT]
Features: [quad]
Attributes: [Batch]
Classes: [short 4:4][long 4:4][verylong 4:4][quad 4:4][default
4:4][single 4:4]
Total Time: INFINITY Up: INFINITY (81.54%) Active: INFINITY
(37.11%)
Reservations:
User 'quad.0.0'(x1) -00:13:26 -> INFINITY ( INFINITY)
Blocked [EMAIL PROTECTED]:13:26 Procs: 4/4 (100.00%)
------------------------------------------------------------------------
That blocked resources line.
So I submit a job specifying this quad queue and it immediately gets
placed into a deferred state in the blocked list.
--------------------------------------------------------------------------
checking job 24640
State: Idle EState: Deferred
Creds: user:bill group:bill class:quad qos:DEFAULT
WallTime: 00:00:00 of 1:12:00:00
SubmitTime: Fri Mar 10 09:32:35
(Time Queued Total: 3:08:51 Eligible: 00:05:27)
Total Tasks: 4
Req[0] TaskCount: 4 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [NONE]
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 0
PartitionMask: [ALL]
Flags: RESTARTABLE
job is deferred. Reason: NoResources (cannot create reservation for
job '24640' (intital reservation attempt)
)
Holds: Defer (hold reason: NoResources)
PE: 4.00 StartPriority: 3087
cannot select job 24640 for partition DEFAULT (job hold active)
-----------------------------------------------------------------------
And the Maui logs show:
03/10 12:46:06 INFO: node node077 can provide resources for job 24640:0
03/10 12:46:06 MLocalJobCheckNRes(24640,node077,2140000000)
03/10 12:46:06 INFO: 8 feasible tasks found for job 24640:0 in
partition DEFAULT (4 Needed)
03/10 12:46:06
MJobGetSNRange(24640,0,node076,([EMAIL PROTECTED]:00:00),256,Affinity,Type,ARange,BRes)
03/10 12:46:06 INFO: attempting to get resources for 24640 4 * (P: 1
M: 0 S: 0 D: 0)
03/10 12:46:06 MResCheckJAccess(24612,24640,129600,Same,Affinity)
03/10 12:46:06 MResCheckJAccess(quad.0.0,24640,129600,Same,Affinity)
03/10 12:46:06 MResCheckJAccess(24612,24640,129600,Same,Affinity)
03/10 12:46:06 MResCheckJAccess(quad.0.0,24640,129600,Same,Affinity)
03/10 12:46:06 INFO: ARange[0] too short for job 24640 (MR: 1 < W:
129600): removing range
03/10 12:46:06 INFO: node node076 unavailable for job 24640 at 00:00:00
03/10 12:46:06 INFO: no reservation time found for job 24640 on node
node076 at 00:00:00
03/10 12:46:06
MJobGetSNRange(24640,0,node077,([EMAIL PROTECTED]:00:00),256,Affinity,Type,ARange,BRes)
03/10 12:46:06 INFO: attempting to get resources for 24640 4 * (P: 1
M: 0 S: 0 D: 0)
03/10 12:46:06 MResCheckJAccess(quad.0.0,24640,129600,Same,Affinity)
03/10 12:46:06 MResCheckJAccess(quad.0.0,24640,129600,Same,Affinity)
03/10 12:46:06 INFO: ARange[0] too short for job 24640 (MR: 1 < W:
129600): removing range
03/10 12:46:06 INFO: node node077 unavailable for job 24640 at 00:00:00
03/10 12:46:06 INFO: no reservation time found for job 24640 on node
node077 at 00:00:00
03/10 12:46:06 MJobSelectFRL(24640,G,1,RCount)
03/10 12:46:06 ALERT: job 24640 cannot run in any partition
03/10 12:46:06 ALERT: cannot create new reservation for job 24640
(shape[1] 4)
03/10 12:46:06 ALERT: cannot create new reservation for job 24640
03/10 12:46:06 MJobSetHold(24640,16,00:05:00,NoResources,cannot create
reservation for job '24640' (intital reservation attempt)
03/10 12:46:06 ALERT: job '24640' cannot run (deferring job for 300
seconds)
----------------------------------------------------------------------------
I must be missing something here but I've reread the documentation and
find nothing. I'm not sure how to further debug. Can anyone provide me
with a further clue as to what might be missing?
Thanks,
Bill
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers