I am continuing to have problems at one site where jobs seem to get sent to a compute node to run, but then the mom seems to lose track of them somehow and starts rejecting requests from the scheduler. Any idea what kind of things I should be checking? The logs don't give any clue _why_ the requests are being refused.

This is what's in the mom_log:

05/23/2006 10:05:32;0080; pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id REJHOST=gridmon.cp.dias.ie MSG=modify job failed, unknown job 10332.gridgate.cp.dias.ie), aux=0, type=ModifyJob, from [EMAIL PROTECTED] 05/23/2006 11:29:50;0080; pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id REJHOST=gridmon.cp.dias.ie MSG=modify job failed, unknown job 10332.gridgate.cp.dias.ie), aux=0, type=ModifyJob, from [EMAIL PROTECTED]

and the output of tracejob:


Job: 10332.gridgate.cp.dias.ie

05/23/2006 09:34:50  S    enqueuing into test, state 1 hop 1
05/23/2006 09:34:50  S    Job Queued at request of [EMAIL PROTECTED],
                          owner = [EMAIL PROTECTED], job name = STDIN,
                          queue = test
05/23/2006 09:34:50  A    queue=test
05/23/2006 10:05:32  S    Job Modified at request of [EMAIL PROTECTED]
05/23/2006 10:05:32  S    Job Run at request of [EMAIL PROTECTED]
05/23/2006 10:05:32  S    Job Modified at request of [EMAIL PROTECTED]
05/23/2006 10:05:32  S    MOM rejected modify request, error: 15001
05/23/2006 10:11:52  S    enqueuing into test, state 1 hop 1
05/23/2006 10:11:52  S    Requeueing job, substate: 37 Requeued in queue: test
05/23/2006 10:12:19  S    Job Modified at request of [EMAIL PROTECTED]
05/23/2006 10:12:19  S    Job Run at request of [EMAIL PROTECTED]
05/23/2006 10:12:19  S    Job Modified at request of [EMAIL PROTECTED]
05/23/2006 10:12:19  S    MOM rejected modify request, error: 15001
05/23/2006 10:42:26  S    Job Modified at request of [EMAIL PROTECTED]
05/23/2006 10:42:26  S    Job Run at request of [EMAIL PROTECTED]
05/23/2006 10:42:26  S    Job Modified at request of [EMAIL PROTECTED]
05/23/2006 10:42:26  S    MOM rejected modify request, error: 15001
05/23/2006 11:12:33  S    Job Modified at request of [EMAIL PROTECTED]
05/23/2006 11:12:33  S    Job Run at request of [EMAIL PROTECTED]
05/23/2006 11:12:33  S    Job Modified at request of [EMAIL PROTECTED]
05/23/2006 11:12:33  S    MOM rejected modify request, error: 15001
05/23/2006 11:24:26  S    enqueuing into test, state 1 hop 1
05/23/2006 11:24:26  S    Requeueing job, substate: 37 Requeued in queue: test
05/23/2006 11:29:50  S    Job Modified at request of [EMAIL PROTECTED]
05/23/2006 11:29:50  S    Job Run at request of [EMAIL PROTECTED]
05/23/2006 11:29:50  S    Job Modified at request of [EMAIL PROTECTED]
05/23/2006 11:29:50  S    MOM rejected modify request, error: 15001


--
Dr. Stephen Childs,
Research Fellow, EGEE Project,    phone:                    +353-1-6081797
Computer Architecture Group,      email:        Stephen.Childs @ cs.tcd.ie
Trinity College Dublin, Ireland   web: http://www.cs.tcd.ie/Stephen.Childs
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers

Reply via email to