Hello,
we are running Torque-2.3.6 and Maui-3.2.6p21 at our cluster. Sometimes
happens following:
The job requesting more worker nodes (for example nodes=8:ppn=2) is
queued and starts at average in one hour even the requested resources
are available. Exactly there are no fully free worker nodes available
but there is a sufficient amount of partially free worker nodes.
There is no clear reason in the output of checkjob.
From checkjob:
checking job 8037
State: Idle
Creds: user:black group:users class:batch qos:DEFAULT
WallTime: 00:00:00 of 41:16:00:00
SubmitTime: Mon May 18 15:45:34
(Time Queued Total: 00:31:51 Eligible: 00:31:51)
StartDate: 00:00:01 Mon May 18 16:17:26
Total Tasks: 16
Req[0] TaskCount: 16 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [batch]
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 0
PartitionMask: [ALL]
Flags: RESTARTABLE
Reservation '8037' (00:00:01 -> 41:16:00:01 Duration: 41:16:00:00)
PE: 16.00 StartPriority: 31
cannot select job 8037 for partition DEFAULT (startdate in '00:00:01')
From Maui configuration:
RMPOLLINTERVAL 00:00:30
BACKFILLPOLICY FIRSTFIT
RESERVATIONPOLICY CURRENTHIGHEST
NODEALLOCATIONPOLICY CPULOAD
ENABLEMULTIREQJOBS TRUE
JOBNODEMATCHPOLICY EXACTNODE
From tracejob:
05/18/2009 15:45:34 S enqueuing into batch, state 1 hop 1
05/18/2009 *15:45:34* S Job Queued at request ...
05/18/2009 15:45:34 A queue=batch
05/18/2009 16:56:25 S Job Modified at request ...
05/18/2009 *16:56:25* S Job Run at request ...
05/18/2009 16:56:25 S Job Modified at request ...
Can anybody explain this strange behaviour?
What does the expresion "cannot select job 8037 for partition DEFAULT"
without any further reason mean?
Thank you.
Best regards
Jana Uhlirova
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers