You might try "checkjob -v". For a job that won't run, this sometimes lists detailed information for each node in the system, which can be very helpful in determining why the job won't run. See example output below. But in your case, it looks like the job did try to start:
> StartDate: -00:18:25 Thu May 19 09:06:51 I'm guessing your job tried to run but there was a problem with one of the nodes. Maui usually puts the job into a deferred state in this case and puts a hold on the job. The torque command 'tracejob' is very helpful in determining this. Run a tracejob on both the head node and the mother superior node to get the full story on the job. Our cluster had been in a state before where 'pbsnodes' reported all the nodes as healthy but many jobs would fail to start and go into a deferred state. The tracejob on the mother superior node would report something like 'send_sisters: sister #6 (r1i0n14) is not ok (1099)'. But when the suspect node was inspected everything seemed fine. Simply releasing the hold (using realeasehold) would allow the job to start successfully, often times on the same set of nodes that failed the first time. Attempts to clear the problem by rebooting the affected nodes did not help. In my case, rebooting the entire cluster (including the infiniband swit ches) did clear up these problems. Darby % checkjob -v 58885 checking job 58885 State: Idle Creds: user:dvicker group:eg3 class:huge qos:DEFAULT WallTime: 00:00:00 of 4:00:00 SubmitTime: Thu May 19 07:25:24 (Time Queued Total: 00:00:01 Eligible: 00:00:01) Total Tasks: 300 Req[0] TaskCount: 300 Partition: ALL Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 Opsys: [NONE] Arch: [NONE] Features: [NONE] Exec: '' ExecSize: 0 ImageSize: 0 Dedicated Resources Per Task: PROCS: 1 NodeAccess: SINGLEUSER TasksPerNode: 12 NodeCount: 25 IWD: [NONE] Executable: [NONE] Bypass: 0 StartCount: 0 PartitionMask: [ALL] Reservation '58885' (1:53:48 -> 5:53:48 Duration: 4:00:00) PE: 300.00 StartPriority: 1 job cannot run in partition DEFAULT (idle procs do not meet requirements : 252 of 300 procs found) idle procs: 348 feasible procs: 252 Rejection Reasons: [State : 83][ReserveTime : 8] Detailed Node Availability Information: r1i0n0 accepted : 12 tasks supported r1i0n1 accepted : 12 tasks supported r1i0n2 accepted : 12 tasks supported r1i0n3 accepted : 12 tasks supported r1i0n4 accepted : 12 tasks supported r1i0n5 accepted : 12 tasks supported r1i0n6 accepted : 12 tasks supported r1i0n7 accepted : 12 tasks supported r1i0n8 accepted : 12 tasks supported r1i0n9 accepted : 12 tasks supported r1i0n10 accepted : 12 tasks supported r1i0n11 rejected : State r1i0n12 rejected : State r1i0n13 rejected : State r1i0n14 rejected : State r1i0n15 rejected : State r1i1n0 rejected : State r1i1n1 rejected : State r1i1n2 rejected : State r1i1n3 rejected : State r1i1n4 rejected : State r1i1n5 rejected : State r1i1n6 rejected : State r1i1n7 rejected : State r1i1n8 rejected : State r1i1n9 rejected : State r1i1n10 rejected : State r1i1n11 rejected : State r1i1n12 accepted : 12 tasks supported r1i1n13 accepted : 12 tasks supported r1i1n14 rejected : State r1i1n15 rejected : State r1i2n0 rejected : State r1i2n1 rejected : State r1i2n2 rejected : State r1i2n3 rejected : State r1i2n4 rejected : State r1i2n5 rejected : State r1i2n6 rejected : State r1i2n7 rejected : State r1i2n8 rejected : State r1i2n9 rejected : State r1i2n10 rejected : State r1i2n11 rejected : State r1i2n12 rejected : State r1i2n13 rejected : State r1i2n14 rejected : State r1i2n15 rejected : State r1i3n0 rejected : State r1i3n1 rejected : State r1i3n2 rejected : State r1i3n3 rejected : State r1i3n4 rejected : State r1i3n5 rejected : State r1i3n6 rejected : State r1i3n7 rejected : State r1i3n8 rejected : State r1i3n9 rejected : State r1i3n10 rejected : State r1i3n11 rejected : State r1i3n12 rejected : State r1i3n13 accepted : 12 tasks supported r1i3n14 accepted : 12 tasks supported r1i3n15 accepted : 12 tasks supported r2i0n0 accepted : 12 tasks supported r2i0n1 accepted : 12 tasks supported r2i0n2 accepted : 12 tasks supported r2i0n3 accepted : 12 tasks supported r2i0n4 accepted : 12 tasks supported r2i0n5 rejected : State r2i0n6 rejected : State r2i0n7 rejected : State r2i0n8 rejected : State r2i0n9 rejected : State r2i0n10 rejected : State r2i0n11 rejected : State r2i0n12 rejected : State r2i0n13 rejected : State r2i0n14 rejected : State r2i0n15 rejected : State r2i1n0 rejected : State r2i1n1 rejected : State r2i1n2 rejected : State r2i1n3 rejected : State r2i1n4 rejected : State r2i1n5 rejected : State r2i1n6 rejected : State r2i1n7 rejected : State r2i1n8 rejected : State r2i1n9 rejected : State r2i1n10 rejected : State r2i1n11 rejected : State r2i1n12 rejected : State r2i1n13 rejected : State r2i1n14 rejected : State r2i1n15 rejected : State r2i3n0 rejected : State r2i3n1 rejected : State r2i3n2 rejected : State r2i3n3 rejected : State r2i3n4 rejected : State r2i3n5 rejected : State r2i3n6 rejected : State r2i3n7 rejected : State r2i3n8 rejected : ReserveTime r2i3n9 rejected : ReserveTime r2i3n10 rejected : ReserveTime r2i3n11 rejected : ReserveTime r2i3n12 rejected : ReserveTime r2i3n13 rejected : ReserveTime r2i3n14 rejected : ReserveTime r2i3n15 rejected : ReserveTime On May 18, 2011, at 11:00 PM, Sudarshan Wadkar wrote: > Dear All, > I am facing a small problem here. checkjob reports that jobs are > blocked because of no resources. > > # checkjob 11691 > > > checking job 11691 > > State: Idle EState: Deferred > Creds: user:trirag09 group:trirag09 class:default qos:DEFAULT > WallTime: 00:00:00 of 1:06:00:00 > SubmitTime: Wed May 18 19:08:23 > (Time Queued Total: 14:16:53 Eligible: 00:00:00) > > StartDate: -00:18:25 Thu May 19 09:06:51 > Total Tasks: 24 > > Req[0] TaskCount: 24 Partition: ALL > Network: [NONE] Memory >= 800M Disk >= 0 Swap >= 0 > Opsys: [NONE] Arch: [NONE] Features: [NONE] > Dedicated Resources Per Task: PROCS: 1 MEM: 800M > > > IWD: [NONE] Executable: [NONE] > Bypass: 0 StartCount: 0 > PartitionMask: [ALL] > Flags: RESTARTABLE > > job is deferred. Reason: NoResources (cannot create reservation for > job '11691' (intital reservation attempt) > ) > Holds: Defer (hold reason: NoResources) > PE: 24.00 StartPriority: 18 > cannot select job 11691 for partition DEFAULT (job hold active) > > But showbf shows idle resources. > > #showbf > backfill window (user: 'trirag09' group: 'trirag09' partition: ALL) > Thu May 19 09:27:20 > > 247 procs available with no timelimit > > I checked nodes with pbsnodes and they report normal. Please help me. > I am kinda clueless as to why maui thinks that there are no resources > and puts the job on deferred mode. > > -- > -Sudarshan Wadkar > Research Assistant & System Administrator > High Performance Computing Center > IIT Bombay, Powai, Mumbai 400 076 > > "Success is getting what you want. Happiness is wanting what you get." > - Dale Carnegie > "It's always our decision who we are" > - Robert Solomon in Waking Life > "The Truth is The Truth, so all you can do is live with it." > - $udhi :) > _______________________________________________ > mauiusers mailing list > [email protected] > http://www.supercluster.org/mailman/listinfo/mauiusers _______________________________________________ mauiusers mailing list [email protected] http://www.supercluster.org/mailman/listinfo/mauiusers
