Ohh fish! now maui is gone dead

[root@norma ~]# checkjob -v 11692
ERROR:    cannot send request to server norma.iitb.ac.in:42559 (server
may not be running)
ERROR:    cannot request service (status)
[root@norma ~]# ps -eaf|grep maui
root      6100  2837  0 22:16 pts/20   00:00:00 grep maui
[root@norma ~]# service maui start
Starting maui:                                             [  OK  ]
[root@norma ~]# ps -eaf|grep maui
root      6121  2837  0 22:17 pts/20   00:00:00 grep maui
[root@norma ~]# date
Thu May 19 22:17:33 IST 2011
[root@norma ~]# tail /opt/maui/log/maui.log
<snip>
05/19 22:17:07 INFO:     located resources for 8 tasks (460) in best
partition DEFAULT for job 11545 at time 00:00:01

Notice the time in the maui log, maui starts, does its resource
calculations and then dies without telling what went wrong.
Please help. Where should I look for debugging maui?

On Thu, May 19, 2011 at 6:18 PM, Vicker, Darby (JSC-EG311)
<[email protected]> wrote:
> You might try "checkjob -v".  For a job that won't run, this sometimes lists 
> detailed information for each node in the system, which can be very helpful 
> in determining why the job won't run.  See example output below.  But in your 
> case, it looks like the job did try to start:
>
>> StartDate: -00:18:25  Thu May 19 09:06:51
>
> I'm guessing your job tried to run but there was a problem with one of the 
> nodes.  Maui usually puts the job into a deferred state in this case and puts 
> a hold on the job.  The torque command 'tracejob' is very helpful in 
> determining this.  Run a tracejob on both the head node and the mother 
> superior node to get the full story on the job.  Our cluster had been in a 
> state before where 'pbsnodes' reported all the nodes as healthy but many jobs 
> would fail to start and go into a deferred state.  The tracejob on the mother 
> superior node would report something like 'send_sisters: sister #6 (r1i0n14) 
> is not ok (1099)'.  But when the suspect node was inspected everything seemed 
> fine.  Simply releasing the hold (using realeasehold) would allow the job to 
> start successfully, often times on the same set of nodes that failed the 
> first time.  Attempts to clear the problem by rebooting the affected nodes 
> did not help.  In my case, rebooting the entire cluster (including the 
> infiniband switches) did clear up these problems.
>
> Darby
>
> % checkjob -v 58885
>
>
> checking job 58885
>
> State: Idle
> Creds:  user:dvicker  group:eg3  class:huge  qos:DEFAULT
> WallTime: 00:00:00 of 4:00:00
> SubmitTime: Thu May 19 07:25:24
>  (Time Queued  Total: 00:00:01  Eligible: 00:00:01)
>
> Total Tasks: 300
>
> Req[0]  TaskCount: 300  Partition: ALL
> Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
> Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
> Exec:  ''  ExecSize: 0  ImageSize: 0
> Dedicated Resources Per Task: PROCS: 1
> NodeAccess: SINGLEUSER
> TasksPerNode: 12  NodeCount: 25
>
>
> IWD: [NONE]  Executable:  [NONE]
> Bypass: 0  StartCount: 0
> PartitionMask: [ALL]
> Reservation '58885' (1:53:48 -> 5:53:48  Duration: 4:00:00)
> PE:  300.00  StartPriority:  1
> job cannot run in partition DEFAULT (idle procs do not meet requirements : 
> 252 of 300 procs found)
> idle procs: 348  feasible procs: 252
>
> Rejection Reasons: [State        :   83][ReserveTime  :    8]
>
> Detailed Node Availability Information:
>
> r1i0n0                   accepted : 12 tasks supported
> r1i0n1                   accepted : 12 tasks supported
> r1i0n2                   accepted : 12 tasks supported
> r1i0n3                   accepted : 12 tasks supported
> r1i0n4                   accepted : 12 tasks supported
> r1i0n5                   accepted : 12 tasks supported
> r1i0n6                   accepted : 12 tasks supported
> r1i0n7                   accepted : 12 tasks supported
> r1i0n8                   accepted : 12 tasks supported
> r1i0n9                   accepted : 12 tasks supported
> r1i0n10                  accepted : 12 tasks supported
> r1i0n11                  rejected : State
> r1i0n12                  rejected : State
> r1i0n13                  rejected : State
> r1i0n14                  rejected : State
> r1i0n15                  rejected : State
> r1i1n0                   rejected : State
> r1i1n1                   rejected : State
> r1i1n2                   rejected : State
> r1i1n3                   rejected : State
> r1i1n4                   rejected : State
> r1i1n5                   rejected : State
> r1i1n6                   rejected : State
> r1i1n7                   rejected : State
> r1i1n8                   rejected : State
> r1i1n9                   rejected : State
> r1i1n10                  rejected : State
> r1i1n11                  rejected : State
> r1i1n12                  accepted : 12 tasks supported
> r1i1n13                  accepted : 12 tasks supported
> r1i1n14                  rejected : State
> r1i1n15                  rejected : State
> r1i2n0                   rejected : State
> r1i2n1                   rejected : State
> r1i2n2                   rejected : State
> r1i2n3                   rejected : State
> r1i2n4                   rejected : State
> r1i2n5                   rejected : State
> r1i2n6                   rejected : State
> r1i2n7                   rejected : State
> r1i2n8                   rejected : State
> r1i2n9                   rejected : State
> r1i2n10                  rejected : State
> r1i2n11                  rejected : State
> r1i2n12                  rejected : State
> r1i2n13                  rejected : State
> r1i2n14                  rejected : State
> r1i2n15                  rejected : State
> r1i3n0                   rejected : State
> r1i3n1                   rejected : State
> r1i3n2                   rejected : State
> r1i3n3                   rejected : State
> r1i3n4                   rejected : State
> r1i3n5                   rejected : State
> r1i3n6                   rejected : State
> r1i3n7                   rejected : State
> r1i3n8                   rejected : State
> r1i3n9                   rejected : State
> r1i3n10                  rejected : State
> r1i3n11                  rejected : State
> r1i3n12                  rejected : State
> r1i3n13                  accepted : 12 tasks supported
> r1i3n14                  accepted : 12 tasks supported
> r1i3n15                  accepted : 12 tasks supported
> r2i0n0                   accepted : 12 tasks supported
> r2i0n1                   accepted : 12 tasks supported
> r2i0n2                   accepted : 12 tasks supported
> r2i0n3                   accepted : 12 tasks supported
> r2i0n4                   accepted : 12 tasks supported
> r2i0n5                   rejected : State
> r2i0n6                   rejected : State
> r2i0n7                   rejected : State
> r2i0n8                   rejected : State
> r2i0n9                   rejected : State
> r2i0n10                  rejected : State
> r2i0n11                  rejected : State
> r2i0n12                  rejected : State
> r2i0n13                  rejected : State
> r2i0n14                  rejected : State
> r2i0n15                  rejected : State
> r2i1n0                   rejected : State
> r2i1n1                   rejected : State
> r2i1n2                   rejected : State
> r2i1n3                   rejected : State
> r2i1n4                   rejected : State
> r2i1n5                   rejected : State
> r2i1n6                   rejected : State
> r2i1n7                   rejected : State
> r2i1n8                   rejected : State
> r2i1n9                   rejected : State
> r2i1n10                  rejected : State
> r2i1n11                  rejected : State
> r2i1n12                  rejected : State
> r2i1n13                  rejected : State
> r2i1n14                  rejected : State
> r2i1n15                  rejected : State
> r2i3n0                   rejected : State
> r2i3n1                   rejected : State
> r2i3n2                   rejected : State
> r2i3n3                   rejected : State
> r2i3n4                   rejected : State
> r2i3n5                   rejected : State
> r2i3n6                   rejected : State
> r2i3n7                   rejected : State
> r2i3n8                   rejected : ReserveTime
> r2i3n9                   rejected : ReserveTime
> r2i3n10                  rejected : ReserveTime
> r2i3n11                  rejected : ReserveTime
> r2i3n12                  rejected : ReserveTime
> r2i3n13                  rejected : ReserveTime
> r2i3n14                  rejected : ReserveTime
> r2i3n15                  rejected : ReserveTime
>
>
>
>
> On May 18, 2011, at 11:00 PM, Sudarshan Wadkar wrote:
>
>> Dear All,
>> I am facing a small problem here. checkjob reports that jobs are
>> blocked because of no resources.
>>
>> # checkjob 11691
>>
>>
>> checking job 11691
>>
>> State: Idle  EState: Deferred
>> Creds:  user:trirag09  group:trirag09  class:default  qos:DEFAULT
>> WallTime: 00:00:00 of 1:06:00:00
>> SubmitTime: Wed May 18 19:08:23
>>  (Time Queued  Total: 14:16:53  Eligible: 00:00:00)
>>
>> StartDate: -00:18:25  Thu May 19 09:06:51
>> Total Tasks: 24
>>
>> Req[0]  TaskCount: 24  Partition: ALL
>> Network: [NONE]  Memory >= 800M  Disk >= 0  Swap >= 0
>> Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
>> Dedicated Resources Per Task: PROCS: 1  MEM: 800M
>>
>>
>> IWD: [NONE]  Executable:  [NONE]
>> Bypass: 0  StartCount: 0
>> PartitionMask: [ALL]
>> Flags:       RESTARTABLE
>>
>> job is deferred.  Reason:  NoResources  (cannot create reservation for
>> job '11691' (intital reservation attempt)
>> )
>> Holds:    Defer  (hold reason:  NoResources)
>> PE:  24.00  StartPriority:  18
>> cannot select job 11691 for partition DEFAULT (job hold active)
>>
>> But showbf shows idle resources.
>>
>> #showbf
>> backfill window (user: 'trirag09' group: 'trirag09' partition: ALL)
>> Thu May 19 09:27:20
>>
>> 247 procs available with no timelimit
>>
>> I checked nodes with pbsnodes and they report normal. Please help me.
>> I am kinda clueless as to why maui thinks that there are no resources
>> and puts the job on deferred mode.
>>
>> --
>> -Sudarshan Wadkar
>> Research Assistant & System Administrator
>> High Performance Computing Center
>> IIT Bombay, Powai, Mumbai 400 076
>>
>> "Success is getting what you want. Happiness is wanting what you get."
>> - Dale Carnegie
>> "It's always our decision who we are"
>> - Robert Solomon in Waking Life
>> "The Truth is The Truth, so all you can do is live with it."
>> - $udhi :)
>> _______________________________________________
>> mauiusers mailing list
>> [email protected]
>> http://www.supercluster.org/mailman/listinfo/mauiusers
>
>



-- 
-Sudarshan Wadkar
Research Assistant & System Administrator
High Performance Computing Center
IIT Bombay, Powai, Mumbai 400 076

"Success is getting what you want. Happiness is wanting what you get."
- Dale Carnegie
"It's always our decision who we are"
- Robert Solomon in Waking Life
"The Truth is The Truth, so all you can do is live with it."
- $udhi :)
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers

Reply via email to