At two of our sites that use PBS routing queues, I have a problem where
large numbers of jobs end up in the 'W' state. As you can see below, they
have been assigned to a compute node (and from what I remember it briefly
enters state 'R' on the node), but then fails and gets stuck. Anyone seen
this behaviour or got any suggestions?
Stephen
[EMAIL PROTECTED] root]# rpm -q maui torque
maui-3.2.6p13-5_SL30X
torque-2.0.0p7-1.sl3.st
[EMAIL PROTECTED] root]# qstat -n 9270
gridgate.cp.dias.ie:
Req'd
Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory
Time S Time
-------------------- -------- -------- ---------- ------ ----- --- ------
----- - -----
9270.gridgate.cp.dia cosmo003 cosmo STDIN -- 1 -- --
24:00 W --
gridwn04
[EMAIL PROTECTED] root]# checkjob 9270
ERROR: 'checkjob' failed
ERROR: cannot locate job '9270'
[EMAIL PROTECTED] root]# grep 9270 /var/log/maui.log|tail
05/02 10:13:13 WARNING: job '9270.gridgate.cp.dias.ie' detected with
unexpected state '11'
05/02 10:13:24 WARNING: job '9270.gridgate.cp.dias.ie' detected with
unexpected state '11'
05/02 10:13:35 WARNING: job '9270.gridgate.cp.dias.ie' detected with
unexpected state '11'
05/02 10:13:46 WARNING: job '9270.gridgate.cp.dias.ie' detected with
unexpected state '11'
05/02 10:13:57 WARNING: job '9270.gridgate.cp.dias.ie' detected with
unexpected state '11'
05/02 10:14:08 WARNING: job '9270.gridgate.cp.dias.ie' detected with
unexpected state '11'
05/02 10:14:19 WARNING: job '9270.gridgate.cp.dias.ie' detected with
unexpected state '11'
05/02 10:14:30 WARNING: job '9270.gridgate.cp.dias.ie' detected with
unexpected state '11'
05/02 10:14:41 WARNING: job '9270.gridgate.cp.dias.ie' detected with
unexpected state '11'
05/02 10:14:52 WARNING: job '9270.gridgate.cp.dias.ie' detected with
unexpected state '11'
[EMAIL PROTECTED] root]#
05/02/2006 09:54:54;0040;PBS_Server;Svr;gridgate.cp.dias.ie;Scheduler sent
command new
05/02/2006 09:54:55;0100;PBS_Server;Req;;Type StatusNode request received
from [EMAIL PROTECTED], sock=9
05/02/2006 09:54:55;0100;PBS_Server;Req;;Type StatusQueue request received
from [EMAIL PROTECTED], sock=9
05/02/2006 09:54:55;0100;PBS_Server;Req;;Type StatusJob request received
from [EMAIL PROTECTED], sock=9
05/02/2006 09:54:55;0100;PBS_Server;Req;;Type ModifyJob request received
from [EMAIL PROTECTED], sock=9
05/02/2006 09:54:55;0008;PBS_Server;Job;9270.gridgate.cp.dias.ie;Job
Modified at request of [EMAIL PROTECTED]
05/02/2006 09:54:55;0100;PBS_Server;Req;;Type RunJob request received from
[EMAIL PROTECTED], sock=9
05/02/2006 09:54:55;0008;PBS_Server;Job;9270.gridgate.cp.dias.ie;Job Run
at request of [EMAIL PROTECTED]
05/02/2006 09:54:55;0100;PBS_Server;Req;;Type ModifyJob request received
from [EMAIL PROTECTED], sock=9
05/02/2006 09:54:55;0008;PBS_Server;Job;9270.gridgate.cp.dias.ie;Job
Modified at request of [EMAIL PROTECTED]
05/02/2006 09:54:55;0008;PBS_Server;Job;9270.gridgate.cp.dias.ie;MOM
rejected modify request, error: 15001
05/02/2006 09:54:55;0080;PBS_Server;Req;req_reject;Reject reply
code=15001(Unknown Job Id), aux=0, type=ModifyJob, from
[EMAIL PROTECTED]
--
Dr. Stephen Childs,
Research Fellow, EGEE Project, phone: +353-1-6081797
Computer Architecture Group, email: Stephen.Childs @ cs.tcd.ie
Trinity College Dublin, Ireland web: http://www.cs.tcd.ie/Stephen.Childs
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers