If you are using maui as well, what does 'showq' give you? In particular, you should see some jobs in either the 'eligible' state or 'blocked' state. Run 'checkjob' on those and you may see why they aren't starting.
If they are in the blocked state, it could be they were unable to start for too long, maui will block them. Check your DEFERTIME and DEFERCOUNT settings to have it try longer before blocking jobs. Brian Andrus ITACS/Research Computing Naval Postgraduate School Monterey, California voice: 831-656-6238 From: [email protected] [mailto:[email protected]] On Behalf Of Jack Wilkinson Sent: Tuesday, September 03, 2013 1:34 PM To: [email protected]; [email protected] Subject: [torqueusers] cluster hangs I'm not sure if this is a maui or a torque issue, so I'm being slightly rude and sending this to both lists. We're running maui 3.3-4 and torque 2.5.7-9 on CentOS 6.3. Most of the time there's no problem. qsub a set of submit files, they run and we get our output, but every now and then, they submit and get held so that qstat shows something like this... [eob_merge@srvBatchHead01 ~]$ qstat Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 48267.srvbatchhead01 FS93130003_000 eob_merge 0 Q batch 48268.srvbatchhead01 FS93130003_001 eob_merge 0 Q batch 48269.srvbatchhead01 FS93130003_002 eob_merge 0 Q batch 48270.srvbatchhead01 FS93130003_003 eob_merge 0 Q batch 48271.srvbatchhead01 FS93130003_004 eob_merge 0 Q batch 48272.srvbatchhead01 FS93130006_000 eob_merge 0 Q batch 48273.srvbatchhead01 FS93130006_001 eob_merge 0 Q batch 48274.srvbatchhead01 FS93130006_002 eob_merge 0 Q batch 48275.srvbatchhead01 FS93130006_003 eob_merge 0 Q batch 48276.srvbatchhead01 FS93130006_004 eob_merge 0 Q batch and they'll sit there forever like this. We will restart all of the associated services: maui, pbs_server, pbs_mom and munge, yet, it doesn't help. Finally we just reboot all of the boxes in the cluster (fortunately, it's a small number) and everything comes back up and runs. I've proposed a weekly reboot of everything, but have been told that this can only be a stop gap measure and cannot be the final solution. Does anyone have any clues? Kind regards, Jack Wilkinson Services | VPay(r) P: 972.367-6622 [email protected]<mailto:[email protected]> www.stoneeagle.com<http://www.stoneeagle.com/> www.vpayusa.com<http://www.vpayusa.com/> 111 W. Spring Valley Rd., #100 Richardson, TX 75081 CONFIDENTIALITY NOTICE: This email, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure, or distribution is prohibited. If you received this email and are not the intended recipient, please inform the sender by email reply and destroy all copies of the original message.
_______________________________________________ mauiusers mailing list [email protected] http://www.supercluster.org/mailman/listinfo/mauiusers
