I am continuing to have problems at one site where jobs seem to get sent
to a compute node to run, but then the mom seems to lose track of them
somehow and starts rejecting requests from the scheduler. Any idea what
kind of things I should be checking? The logs don't give any clue _why_
the requests are being refused.
This is what's in the mom_log:
05/23/2006 10:05:32;0080; pbs_mom;Req;req_reject;Reject reply
code=15001(Unknown Job Id REJHOST=gridmon.cp.dias.ie MSG=modify job
failed, unknown job 10332.gridgate.cp.dias.ie), aux=0, type=ModifyJob,
from [EMAIL PROTECTED]
05/23/2006 11:29:50;0080; pbs_mom;Req;req_reject;Reject reply
code=15001(Unknown Job Id REJHOST=gridmon.cp.dias.ie MSG=modify job
failed, unknown job 10332.gridgate.cp.dias.ie), aux=0, type=ModifyJob,
from [EMAIL PROTECTED]
and the output of tracejob:
Job: 10332.gridgate.cp.dias.ie
05/23/2006 09:34:50 S enqueuing into test, state 1 hop 1
05/23/2006 09:34:50 S Job Queued at request of [EMAIL PROTECTED],
owner = [EMAIL PROTECTED], job name = STDIN,
queue = test
05/23/2006 09:34:50 A queue=test
05/23/2006 10:05:32 S Job Modified at request of [EMAIL PROTECTED]
05/23/2006 10:05:32 S Job Run at request of [EMAIL PROTECTED]
05/23/2006 10:05:32 S Job Modified at request of [EMAIL PROTECTED]
05/23/2006 10:05:32 S MOM rejected modify request, error: 15001
05/23/2006 10:11:52 S enqueuing into test, state 1 hop 1
05/23/2006 10:11:52 S Requeueing job, substate: 37 Requeued in queue: test
05/23/2006 10:12:19 S Job Modified at request of [EMAIL PROTECTED]
05/23/2006 10:12:19 S Job Run at request of [EMAIL PROTECTED]
05/23/2006 10:12:19 S Job Modified at request of [EMAIL PROTECTED]
05/23/2006 10:12:19 S MOM rejected modify request, error: 15001
05/23/2006 10:42:26 S Job Modified at request of [EMAIL PROTECTED]
05/23/2006 10:42:26 S Job Run at request of [EMAIL PROTECTED]
05/23/2006 10:42:26 S Job Modified at request of [EMAIL PROTECTED]
05/23/2006 10:42:26 S MOM rejected modify request, error: 15001
05/23/2006 11:12:33 S Job Modified at request of [EMAIL PROTECTED]
05/23/2006 11:12:33 S Job Run at request of [EMAIL PROTECTED]
05/23/2006 11:12:33 S Job Modified at request of [EMAIL PROTECTED]
05/23/2006 11:12:33 S MOM rejected modify request, error: 15001
05/23/2006 11:24:26 S enqueuing into test, state 1 hop 1
05/23/2006 11:24:26 S Requeueing job, substate: 37 Requeued in queue: test
05/23/2006 11:29:50 S Job Modified at request of [EMAIL PROTECTED]
05/23/2006 11:29:50 S Job Run at request of [EMAIL PROTECTED]
05/23/2006 11:29:50 S Job Modified at request of [EMAIL PROTECTED]
05/23/2006 11:29:50 S MOM rejected modify request, error: 15001
--
Dr. Stephen Childs,
Research Fellow, EGEE Project, phone: +353-1-6081797
Computer Architecture Group, email: Stephen.Childs @ cs.tcd.ie
Trinity College Dublin, Ireland web: http://www.cs.tcd.ie/Stephen.Childs
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers