At two of our sites that use PBS routing queues, I have a problem where large numbers of jobs end up in the 'W' state. As you can see below, they have been assigned to a compute node (and from what I remember it briefly enters state 'R' on the node), but then fails and gets stuck. Anyone seen this behaviour or got any suggestions?

Stephen


[EMAIL PROTECTED] root]# rpm -q maui torque
maui-3.2.6p13-5_SL30X
torque-2.0.0p7-1.sl3.st



[EMAIL PROTECTED] root]# qstat -n 9270

gridgate.cp.dias.ie:
Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------------- -------- -------- ---------- ------ ----- --- ------ ----- - ----- 9270.gridgate.cp.dia cosmo003 cosmo STDIN -- 1 -- -- 24:00 W --
   gridwn04

[EMAIL PROTECTED] root]# checkjob 9270
ERROR:    'checkjob' failed
ERROR:  cannot locate job '9270'

[EMAIL PROTECTED] root]# grep 9270 /var/log/maui.log|tail
05/02 10:13:13 WARNING: job '9270.gridgate.cp.dias.ie' detected with unexpected state '11' 05/02 10:13:24 WARNING: job '9270.gridgate.cp.dias.ie' detected with unexpected state '11' 05/02 10:13:35 WARNING: job '9270.gridgate.cp.dias.ie' detected with unexpected state '11' 05/02 10:13:46 WARNING: job '9270.gridgate.cp.dias.ie' detected with unexpected state '11' 05/02 10:13:57 WARNING: job '9270.gridgate.cp.dias.ie' detected with unexpected state '11' 05/02 10:14:08 WARNING: job '9270.gridgate.cp.dias.ie' detected with unexpected state '11' 05/02 10:14:19 WARNING: job '9270.gridgate.cp.dias.ie' detected with unexpected state '11' 05/02 10:14:30 WARNING: job '9270.gridgate.cp.dias.ie' detected with unexpected state '11' 05/02 10:14:41 WARNING: job '9270.gridgate.cp.dias.ie' detected with unexpected state '11' 05/02 10:14:52 WARNING: job '9270.gridgate.cp.dias.ie' detected with unexpected state '11'
[EMAIL PROTECTED] root]#


05/02/2006 09:54:54;0040;PBS_Server;Svr;gridgate.cp.dias.ie;Scheduler sent command new 05/02/2006 09:54:55;0100;PBS_Server;Req;;Type StatusNode request received from [EMAIL PROTECTED], sock=9 05/02/2006 09:54:55;0100;PBS_Server;Req;;Type StatusQueue request received from [EMAIL PROTECTED], sock=9 05/02/2006 09:54:55;0100;PBS_Server;Req;;Type StatusJob request received from [EMAIL PROTECTED], sock=9 05/02/2006 09:54:55;0100;PBS_Server;Req;;Type ModifyJob request received from [EMAIL PROTECTED], sock=9 05/02/2006 09:54:55;0008;PBS_Server;Job;9270.gridgate.cp.dias.ie;Job Modified at request of [EMAIL PROTECTED] 05/02/2006 09:54:55;0100;PBS_Server;Req;;Type RunJob request received from [EMAIL PROTECTED], sock=9 05/02/2006 09:54:55;0008;PBS_Server;Job;9270.gridgate.cp.dias.ie;Job Run at request of [EMAIL PROTECTED] 05/02/2006 09:54:55;0100;PBS_Server;Req;;Type ModifyJob request received from [EMAIL PROTECTED], sock=9 05/02/2006 09:54:55;0008;PBS_Server;Job;9270.gridgate.cp.dias.ie;Job Modified at request of [EMAIL PROTECTED] 05/02/2006 09:54:55;0008;PBS_Server;Job;9270.gridgate.cp.dias.ie;MOM rejected modify request, error: 15001 05/02/2006 09:54:55;0080;PBS_Server;Req;req_reject;Reject reply code=15001(Unknown Job Id), aux=0, type=ModifyJob, from [EMAIL PROTECTED]



--
Dr. Stephen Childs,
Research Fellow, EGEE Project,    phone:                    +353-1-6081797
Computer Architecture Group,      email:        Stephen.Childs @ cs.tcd.ie
Trinity College Dublin, Ireland   web: http://www.cs.tcd.ie/Stephen.Childs
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers

Reply via email to