Ohh fish! now maui is gone dead [root@norma ~]# checkjob -v 11692 ERROR: cannot send request to server norma.iitb.ac.in:42559 (server may not be running) ERROR: cannot request service (status) [root@norma ~]# ps -eaf|grep maui root 6100 2837 0 22:16 pts/20 00:00:00 grep maui [root@norma ~]# service maui start Starting maui: [ OK ] [root@norma ~]# ps -eaf|grep maui root 6121 2837 0 22:17 pts/20 00:00:00 grep maui [root@norma ~]# date Thu May 19 22:17:33 IST 2011 [root@norma ~]# tail /opt/maui/log/maui.log <snip> 05/19 22:17:07 INFO: located resources for 8 tasks (460) in best partition DEFAULT for job 11545 at time 00:00:01
Notice the time in the maui log, maui starts, does its resource calculations and then dies without telling what went wrong. Please help. Where should I look for debugging maui? On Thu, May 19, 2011 at 6:18 PM, Vicker, Darby (JSC-EG311) <[email protected]> wrote: > You might try "checkjob -v". For a job that won't run, this sometimes lists > detailed information for each node in the system, which can be very helpful > in determining why the job won't run. See example output below. But in your > case, it looks like the job did try to start: > >> StartDate: -00:18:25 Thu May 19 09:06:51 > > I'm guessing your job tried to run but there was a problem with one of the > nodes. Maui usually puts the job into a deferred state in this case and puts > a hold on the job. The torque command 'tracejob' is very helpful in > determining this. Run a tracejob on both the head node and the mother > superior node to get the full story on the job. Our cluster had been in a > state before where 'pbsnodes' reported all the nodes as healthy but many jobs > would fail to start and go into a deferred state. The tracejob on the mother > superior node would report something like 'send_sisters: sister #6 (r1i0n14) > is not ok (1099)'. But when the suspect node was inspected everything seemed > fine. Simply releasing the hold (using realeasehold) would allow the job to > start successfully, often times on the same set of nodes that failed the > first time. Attempts to clear the problem by rebooting the affected nodes > did not help. In my case, rebooting the entire cluster (including the > infiniband switches) did clear up these problems. > > Darby > > % checkjob -v 58885 > > > checking job 58885 > > State: Idle > Creds: user:dvicker group:eg3 class:huge qos:DEFAULT > WallTime: 00:00:00 of 4:00:00 > SubmitTime: Thu May 19 07:25:24 > (Time Queued Total: 00:00:01 Eligible: 00:00:01) > > Total Tasks: 300 > > Req[0] TaskCount: 300 Partition: ALL > Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 > Opsys: [NONE] Arch: [NONE] Features: [NONE] > Exec: '' ExecSize: 0 ImageSize: 0 > Dedicated Resources Per Task: PROCS: 1 > NodeAccess: SINGLEUSER > TasksPerNode: 12 NodeCount: 25 > > > IWD: [NONE] Executable: [NONE] > Bypass: 0 StartCount: 0 > PartitionMask: [ALL] > Reservation '58885' (1:53:48 -> 5:53:48 Duration: 4:00:00) > PE: 300.00 StartPriority: 1 > job cannot run in partition DEFAULT (idle procs do not meet requirements : > 252 of 300 procs found) > idle procs: 348 feasible procs: 252 > > Rejection Reasons: [State : 83][ReserveTime : 8] > > Detailed Node Availability Information: > > r1i0n0 accepted : 12 tasks supported > r1i0n1 accepted : 12 tasks supported > r1i0n2 accepted : 12 tasks supported > r1i0n3 accepted : 12 tasks supported > r1i0n4 accepted : 12 tasks supported > r1i0n5 accepted : 12 tasks supported > r1i0n6 accepted : 12 tasks supported > r1i0n7 accepted : 12 tasks supported > r1i0n8 accepted : 12 tasks supported > r1i0n9 accepted : 12 tasks supported > r1i0n10 accepted : 12 tasks supported > r1i0n11 rejected : State > r1i0n12 rejected : State > r1i0n13 rejected : State > r1i0n14 rejected : State > r1i0n15 rejected : State > r1i1n0 rejected : State > r1i1n1 rejected : State > r1i1n2 rejected : State > r1i1n3 rejected : State > r1i1n4 rejected : State > r1i1n5 rejected : State > r1i1n6 rejected : State > r1i1n7 rejected : State > r1i1n8 rejected : State > r1i1n9 rejected : State > r1i1n10 rejected : State > r1i1n11 rejected : State > r1i1n12 accepted : 12 tasks supported > r1i1n13 accepted : 12 tasks supported > r1i1n14 rejected : State > r1i1n15 rejected : State > r1i2n0 rejected : State > r1i2n1 rejected : State > r1i2n2 rejected : State > r1i2n3 rejected : State > r1i2n4 rejected : State > r1i2n5 rejected : State > r1i2n6 rejected : State > r1i2n7 rejected : State > r1i2n8 rejected : State > r1i2n9 rejected : State > r1i2n10 rejected : State > r1i2n11 rejected : State > r1i2n12 rejected : State > r1i2n13 rejected : State > r1i2n14 rejected : State > r1i2n15 rejected : State > r1i3n0 rejected : State > r1i3n1 rejected : State > r1i3n2 rejected : State > r1i3n3 rejected : State > r1i3n4 rejected : State > r1i3n5 rejected : State > r1i3n6 rejected : State > r1i3n7 rejected : State > r1i3n8 rejected : State > r1i3n9 rejected : State > r1i3n10 rejected : State > r1i3n11 rejected : State > r1i3n12 rejected : State > r1i3n13 accepted : 12 tasks supported > r1i3n14 accepted : 12 tasks supported > r1i3n15 accepted : 12 tasks supported > r2i0n0 accepted : 12 tasks supported > r2i0n1 accepted : 12 tasks supported > r2i0n2 accepted : 12 tasks supported > r2i0n3 accepted : 12 tasks supported > r2i0n4 accepted : 12 tasks supported > r2i0n5 rejected : State > r2i0n6 rejected : State > r2i0n7 rejected : State > r2i0n8 rejected : State > r2i0n9 rejected : State > r2i0n10 rejected : State > r2i0n11 rejected : State > r2i0n12 rejected : State > r2i0n13 rejected : State > r2i0n14 rejected : State > r2i0n15 rejected : State > r2i1n0 rejected : State > r2i1n1 rejected : State > r2i1n2 rejected : State > r2i1n3 rejected : State > r2i1n4 rejected : State > r2i1n5 rejected : State > r2i1n6 rejected : State > r2i1n7 rejected : State > r2i1n8 rejected : State > r2i1n9 rejected : State > r2i1n10 rejected : State > r2i1n11 rejected : State > r2i1n12 rejected : State > r2i1n13 rejected : State > r2i1n14 rejected : State > r2i1n15 rejected : State > r2i3n0 rejected : State > r2i3n1 rejected : State > r2i3n2 rejected : State > r2i3n3 rejected : State > r2i3n4 rejected : State > r2i3n5 rejected : State > r2i3n6 rejected : State > r2i3n7 rejected : State > r2i3n8 rejected : ReserveTime > r2i3n9 rejected : ReserveTime > r2i3n10 rejected : ReserveTime > r2i3n11 rejected : ReserveTime > r2i3n12 rejected : ReserveTime > r2i3n13 rejected : ReserveTime > r2i3n14 rejected : ReserveTime > r2i3n15 rejected : ReserveTime > > > > > On May 18, 2011, at 11:00 PM, Sudarshan Wadkar wrote: > >> Dear All, >> I am facing a small problem here. checkjob reports that jobs are >> blocked because of no resources. >> >> # checkjob 11691 >> >> >> checking job 11691 >> >> State: Idle EState: Deferred >> Creds: user:trirag09 group:trirag09 class:default qos:DEFAULT >> WallTime: 00:00:00 of 1:06:00:00 >> SubmitTime: Wed May 18 19:08:23 >> (Time Queued Total: 14:16:53 Eligible: 00:00:00) >> >> StartDate: -00:18:25 Thu May 19 09:06:51 >> Total Tasks: 24 >> >> Req[0] TaskCount: 24 Partition: ALL >> Network: [NONE] Memory >= 800M Disk >= 0 Swap >= 0 >> Opsys: [NONE] Arch: [NONE] Features: [NONE] >> Dedicated Resources Per Task: PROCS: 1 MEM: 800M >> >> >> IWD: [NONE] Executable: [NONE] >> Bypass: 0 StartCount: 0 >> PartitionMask: [ALL] >> Flags: RESTARTABLE >> >> job is deferred. Reason: NoResources (cannot create reservation for >> job '11691' (intital reservation attempt) >> ) >> Holds: Defer (hold reason: NoResources) >> PE: 24.00 StartPriority: 18 >> cannot select job 11691 for partition DEFAULT (job hold active) >> >> But showbf shows idle resources. >> >> #showbf >> backfill window (user: 'trirag09' group: 'trirag09' partition: ALL) >> Thu May 19 09:27:20 >> >> 247 procs available with no timelimit >> >> I checked nodes with pbsnodes and they report normal. Please help me. >> I am kinda clueless as to why maui thinks that there are no resources >> and puts the job on deferred mode. >> >> -- >> -Sudarshan Wadkar >> Research Assistant & System Administrator >> High Performance Computing Center >> IIT Bombay, Powai, Mumbai 400 076 >> >> "Success is getting what you want. Happiness is wanting what you get." >> - Dale Carnegie >> "It's always our decision who we are" >> - Robert Solomon in Waking Life >> "The Truth is The Truth, so all you can do is live with it." >> - $udhi :) >> _______________________________________________ >> mauiusers mailing list >> [email protected] >> http://www.supercluster.org/mailman/listinfo/mauiusers > > -- -Sudarshan Wadkar Research Assistant & System Administrator High Performance Computing Center IIT Bombay, Powai, Mumbai 400 076 "Success is getting what you want. Happiness is wanting what you get." - Dale Carnegie "It's always our decision who we are" - Robert Solomon in Waking Life "The Truth is The Truth, so all you can do is live with it." - $udhi :) _______________________________________________ mauiusers mailing list [email protected] http://www.supercluster.org/mailman/listinfo/mauiusers
