You might try "checkjob -v".  For a job that won't run, this sometimes lists 
detailed information for each node in the system, which can be very helpful in 
determining why the job won't run.  See example output below.  But in your 
case, it looks like the job did try to start:

> StartDate: -00:18:25  Thu May 19 09:06:51

I'm guessing your job tried to run but there was a problem with one of the 
nodes.  Maui usually puts the job into a deferred state in this case and puts a 
hold on the job.  The torque command 'tracejob' is very helpful in determining 
this.  Run a tracejob on both the head node and the mother superior node to get 
the full story on the job.  Our cluster had been in a state before where 
'pbsnodes' reported all the nodes as healthy but many jobs would fail to start 
and go into a deferred state.  The tracejob on the mother superior node would 
report something like 'send_sisters: sister #6 (r1i0n14) is not ok (1099)'.  
But when the suspect node was inspected everything seemed fine.  Simply 
releasing the hold (using realeasehold) would allow the job to start 
successfully, often times on the same set of nodes that failed the first time.  
Attempts to clear the problem by rebooting the affected nodes did not help.  In 
my case, rebooting the entire cluster (including the infiniband swit
 ches) did clear up these problems.  

Darby

% checkjob -v 58885


checking job 58885 

State: Idle
Creds:  user:dvicker  group:eg3  class:huge  qos:DEFAULT
WallTime: 00:00:00 of 4:00:00
SubmitTime: Thu May 19 07:25:24
  (Time Queued  Total: 00:00:01  Eligible: 00:00:01)

Total Tasks: 300

Req[0]  TaskCount: 300  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
Exec:  ''  ExecSize: 0  ImageSize: 0
Dedicated Resources Per Task: PROCS: 1
NodeAccess: SINGLEUSER
TasksPerNode: 12  NodeCount: 25


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 0
PartitionMask: [ALL]
Reservation '58885' (1:53:48 -> 5:53:48  Duration: 4:00:00)
PE:  300.00  StartPriority:  1
job cannot run in partition DEFAULT (idle procs do not meet requirements : 252 
of 300 procs found)
idle procs: 348  feasible procs: 252

Rejection Reasons: [State        :   83][ReserveTime  :    8]

Detailed Node Availability Information:

r1i0n0                   accepted : 12 tasks supported
r1i0n1                   accepted : 12 tasks supported
r1i0n2                   accepted : 12 tasks supported
r1i0n3                   accepted : 12 tasks supported
r1i0n4                   accepted : 12 tasks supported
r1i0n5                   accepted : 12 tasks supported
r1i0n6                   accepted : 12 tasks supported
r1i0n7                   accepted : 12 tasks supported
r1i0n8                   accepted : 12 tasks supported
r1i0n9                   accepted : 12 tasks supported
r1i0n10                  accepted : 12 tasks supported
r1i0n11                  rejected : State
r1i0n12                  rejected : State
r1i0n13                  rejected : State
r1i0n14                  rejected : State
r1i0n15                  rejected : State
r1i1n0                   rejected : State
r1i1n1                   rejected : State
r1i1n2                   rejected : State
r1i1n3                   rejected : State
r1i1n4                   rejected : State
r1i1n5                   rejected : State
r1i1n6                   rejected : State
r1i1n7                   rejected : State
r1i1n8                   rejected : State
r1i1n9                   rejected : State
r1i1n10                  rejected : State
r1i1n11                  rejected : State
r1i1n12                  accepted : 12 tasks supported
r1i1n13                  accepted : 12 tasks supported
r1i1n14                  rejected : State
r1i1n15                  rejected : State
r1i2n0                   rejected : State
r1i2n1                   rejected : State
r1i2n2                   rejected : State
r1i2n3                   rejected : State
r1i2n4                   rejected : State
r1i2n5                   rejected : State
r1i2n6                   rejected : State
r1i2n7                   rejected : State
r1i2n8                   rejected : State
r1i2n9                   rejected : State
r1i2n10                  rejected : State
r1i2n11                  rejected : State
r1i2n12                  rejected : State
r1i2n13                  rejected : State
r1i2n14                  rejected : State
r1i2n15                  rejected : State
r1i3n0                   rejected : State
r1i3n1                   rejected : State
r1i3n2                   rejected : State
r1i3n3                   rejected : State
r1i3n4                   rejected : State
r1i3n5                   rejected : State
r1i3n6                   rejected : State
r1i3n7                   rejected : State
r1i3n8                   rejected : State
r1i3n9                   rejected : State
r1i3n10                  rejected : State
r1i3n11                  rejected : State
r1i3n12                  rejected : State
r1i3n13                  accepted : 12 tasks supported
r1i3n14                  accepted : 12 tasks supported
r1i3n15                  accepted : 12 tasks supported
r2i0n0                   accepted : 12 tasks supported
r2i0n1                   accepted : 12 tasks supported
r2i0n2                   accepted : 12 tasks supported
r2i0n3                   accepted : 12 tasks supported
r2i0n4                   accepted : 12 tasks supported
r2i0n5                   rejected : State
r2i0n6                   rejected : State
r2i0n7                   rejected : State
r2i0n8                   rejected : State
r2i0n9                   rejected : State
r2i0n10                  rejected : State
r2i0n11                  rejected : State
r2i0n12                  rejected : State
r2i0n13                  rejected : State
r2i0n14                  rejected : State
r2i0n15                  rejected : State
r2i1n0                   rejected : State
r2i1n1                   rejected : State
r2i1n2                   rejected : State
r2i1n3                   rejected : State
r2i1n4                   rejected : State
r2i1n5                   rejected : State
r2i1n6                   rejected : State
r2i1n7                   rejected : State
r2i1n8                   rejected : State
r2i1n9                   rejected : State
r2i1n10                  rejected : State
r2i1n11                  rejected : State
r2i1n12                  rejected : State
r2i1n13                  rejected : State
r2i1n14                  rejected : State
r2i1n15                  rejected : State
r2i3n0                   rejected : State
r2i3n1                   rejected : State
r2i3n2                   rejected : State
r2i3n3                   rejected : State
r2i3n4                   rejected : State
r2i3n5                   rejected : State
r2i3n6                   rejected : State
r2i3n7                   rejected : State
r2i3n8                   rejected : ReserveTime
r2i3n9                   rejected : ReserveTime
r2i3n10                  rejected : ReserveTime
r2i3n11                  rejected : ReserveTime
r2i3n12                  rejected : ReserveTime
r2i3n13                  rejected : ReserveTime
r2i3n14                  rejected : ReserveTime
r2i3n15                  rejected : ReserveTime




On May 18, 2011, at 11:00 PM, Sudarshan Wadkar wrote:

> Dear All,
> I am facing a small problem here. checkjob reports that jobs are
> blocked because of no resources.
> 
> # checkjob 11691
> 
> 
> checking job 11691
> 
> State: Idle  EState: Deferred
> Creds:  user:trirag09  group:trirag09  class:default  qos:DEFAULT
> WallTime: 00:00:00 of 1:06:00:00
> SubmitTime: Wed May 18 19:08:23
>  (Time Queued  Total: 14:16:53  Eligible: 00:00:00)
> 
> StartDate: -00:18:25  Thu May 19 09:06:51
> Total Tasks: 24
> 
> Req[0]  TaskCount: 24  Partition: ALL
> Network: [NONE]  Memory >= 800M  Disk >= 0  Swap >= 0
> Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
> Dedicated Resources Per Task: PROCS: 1  MEM: 800M
> 
> 
> IWD: [NONE]  Executable:  [NONE]
> Bypass: 0  StartCount: 0
> PartitionMask: [ALL]
> Flags:       RESTARTABLE
> 
> job is deferred.  Reason:  NoResources  (cannot create reservation for
> job '11691' (intital reservation attempt)
> )
> Holds:    Defer  (hold reason:  NoResources)
> PE:  24.00  StartPriority:  18
> cannot select job 11691 for partition DEFAULT (job hold active)
> 
> But showbf shows idle resources.
> 
> #showbf
> backfill window (user: 'trirag09' group: 'trirag09' partition: ALL)
> Thu May 19 09:27:20
> 
> 247 procs available with no timelimit
> 
> I checked nodes with pbsnodes and they report normal. Please help me.
> I am kinda clueless as to why maui thinks that there are no resources
> and puts the job on deferred mode.
> 
> -- 
> -Sudarshan Wadkar
> Research Assistant & System Administrator
> High Performance Computing Center
> IIT Bombay, Powai, Mumbai 400 076
> 
> "Success is getting what you want. Happiness is wanting what you get."
> - Dale Carnegie
> "It's always our decision who we are"
> - Robert Solomon in Waking Life
> "The Truth is The Truth, so all you can do is live with it."
> - $udhi :)
> _______________________________________________
> mauiusers mailing list
> [email protected]
> http://www.supercluster.org/mailman/listinfo/mauiusers

_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers

Reply via email to