Environment
-----------
maui-3.2.6p13
torque-1.1.0p6
linux cluster

I had this working. Or so I thought. But after a pbs-server reboot and a maui reboot, jobs just defer.

I have two nodes with quad processors that I wish to allow only jobs specifying the #PBS -q quad can have access to and run.

-------------------------------------------------
In Torque:
create queue quad
set queue quad queue_type = Execution
set queue quad acl_hosts = node076+node077
set queue quad resources_max.nodect = 2
set queue quad enabled = True
set queue quad started = True
---------------------------------------------------
In maui:
SRCFG[quad] HOSTLIST=node076,node077
SRCFG[quad] FLAGS=BYNAME
SRCFG[quad] PERIOD=INFINITY
SRCFG[quad] CLASSLIST=quad

CLASSCFG[quad]          PRIORITY=0
CLASSCFG[quad]          FLAGS=ADVRES:quad.0.0
------------------------------------------------------

diagnose -r

quad.0.0 User DEF -00:13:26 INFINITY INFINITY 2 2 8
    Flags: STANDINGRES BYNAME
    ACL: RES==quad.0= CLASS==quad+
    CL:  RES==quad.0
    Task Resources: PROCS: [ALL]
    Attributes (HostList='node076 node077')
    Active PH: 0.00/1.79 (0.00%)
SRAttributes (TaskCount: 0 StartTime: 00:00:00 EndTime: 1:00:00:00 Days: ALL)
---------------------------------------------------------------------------

so the reservation is there and appears active. But when I do a "checknode node077" I see that in reservations there is something which doesn't seem correct.

----------------------------------------------------------------------------
checking node node077

State:      Idle  (in current state for 00:13:26)
Configured Resources: PROCS: 4  MEM: 15G  SWAP: 16G  DISK: 1M
Utilized   Resources: [NONE]
Dedicated  Resources: [NONE]
Opsys:       DEFAULT  Arch:       linux
Speed:      1.00  Load:       0.000
Network:    [DEFAULT]
Features:   [quad]
Attributes: [Batch]
Classes: [short 4:4][long 4:4][verylong 4:4][quad 4:4][default 4:4][single 4:4]

Total Time:   INFINITY  Up:   INFINITY (81.54%)  Active:   INFINITY (37.11%)

Reservations:
  User 'quad.0.0'(x1)  -00:13:26 ->   INFINITY (  INFINITY)
    Blocked [EMAIL PROTECTED]:13:26   Procs: 4/4 (100.00%)
------------------------------------------------------------------------
That blocked resources line.
So I submit a job specifying this quad queue and it immediately gets placed into a deferred state in the blocked list.

--------------------------------------------------------------------------
checking job 24640

State: Idle  EState: Deferred
Creds:  user:bill  group:bill  class:quad  qos:DEFAULT
WallTime: 00:00:00 of 1:12:00:00
SubmitTime: Fri Mar 10 09:32:35
  (Time Queued  Total: 3:08:51  Eligible: 00:05:27)

Total Tasks: 4

Req[0]  TaskCount: 4  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [NONE]


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 0
PartitionMask: [ALL]
Flags:       RESTARTABLE

job is deferred. Reason: NoResources (cannot create reservation for job '24640' (intital reservation attempt)
)
Holds:    Defer  (hold reason:  NoResources)
PE:  4.00  StartPriority:  3087
cannot select job 24640 for partition DEFAULT (job hold active)
-----------------------------------------------------------------------

And the Maui logs show:

03/10 12:46:06 INFO:     node node077 can provide resources for job 24640:0
03/10 12:46:06 MLocalJobCheckNRes(24640,node077,2140000000)
03/10 12:46:06 INFO: 8 feasible tasks found for job 24640:0 in partition DEFAULT (4 Needed) 03/10 12:46:06 MJobGetSNRange(24640,0,node076,([EMAIL PROTECTED]:00:00),256,Affinity,Type,ARange,BRes) 03/10 12:46:06 INFO: attempting to get resources for 24640 4 * (P: 1 M: 0 S: 0 D: 0)
03/10 12:46:06 MResCheckJAccess(24612,24640,129600,Same,Affinity)
03/10 12:46:06 MResCheckJAccess(quad.0.0,24640,129600,Same,Affinity)
03/10 12:46:06 MResCheckJAccess(24612,24640,129600,Same,Affinity)
03/10 12:46:06 MResCheckJAccess(quad.0.0,24640,129600,Same,Affinity)
03/10 12:46:06 INFO: ARange[0] too short for job 24640 (MR: 1 < W: 129600): removing range
03/10 12:46:06 INFO:     node node076 unavailable for job 24640 at 00:00:00
03/10 12:46:06 INFO: no reservation time found for job 24640 on node node076 at 00:00:00 03/10 12:46:06 MJobGetSNRange(24640,0,node077,([EMAIL PROTECTED]:00:00),256,Affinity,Type,ARange,BRes) 03/10 12:46:06 INFO: attempting to get resources for 24640 4 * (P: 1 M: 0 S: 0 D: 0)
03/10 12:46:06 MResCheckJAccess(quad.0.0,24640,129600,Same,Affinity)
03/10 12:46:06 MResCheckJAccess(quad.0.0,24640,129600,Same,Affinity)
03/10 12:46:06 INFO: ARange[0] too short for job 24640 (MR: 1 < W: 129600): removing range
03/10 12:46:06 INFO:     node node077 unavailable for job 24640 at 00:00:00
03/10 12:46:06 INFO: no reservation time found for job 24640 on node node077 at 00:00:00
03/10 12:46:06 MJobSelectFRL(24640,G,1,RCount)
03/10 12:46:06 ALERT:    job 24640 cannot run in any partition
03/10 12:46:06 ALERT: cannot create new reservation for job 24640 (shape[1] 4)
03/10 12:46:06 ALERT:    cannot create new reservation for job 24640
03/10 12:46:06 MJobSetHold(24640,16,00:05:00,NoResources,cannot create reservation for job '24640' (intital reservation attempt) 03/10 12:46:06 ALERT: job '24640' cannot run (deferring job for 300 seconds)
----------------------------------------------------------------------------

I must be missing something here but I've reread the documentation and find nothing. I'm not sure how to further debug. Can anyone provide me with a further clue as to what might be missing?

Thanks,
Bill

_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers

Reply via email to