ok, now I can reproduce it. When proxy expires when job manager is
waiting for COMMIT_END signal, it stops accepting new jobs. It seems I
can restore it by sending commit_end, but this still looks like a bug
to me as client may loose job id. 


Cheers,
Yuriy

Excerpts from Yuriy Halytskyy's message of Tue Sep 13 15:59:09 +1200 2011:
> Hi,
> 
> user job manager gets into the state where the submission with globusrun 
> hangs and job is never submitted
> 
> server logs say:
> Sep 13 15:35:30 gram5 gridinfo[30804]: ts=2011-09-13T03:35:30.104424Z 
> id=30804 event=gram.job.start level=INFO 
> gramid=/16145890501405029996/576663433152357309/
> peer=130.216.189.203:57672
> Sep 13 15:35:30 gram5 gridinfo[30804]: ts=2011-09-13T03:35:30.104582Z 
> id=30804 event=gram.add_request.end level=WARN 
> gramid=/16145890501405029996/576663433152357309/ status=-130
> reason="the job manager was sent a stop signal (job is still running)"
> Sep 13 15:35:30 gram5 gridinfo[30804]: ts=2011-09-13T03:35:30.104885Z 
> id=30804 event=gram.job.end level=INFO 
> gramid=/16145890501405029996/576663433152357309/ status=-130 msg="Request
> start failed" reason="the job manager was sent a stop signal (job is still 
> running)"
> 
> submission with globusrun hangs:
> 
> globusrun -batch     -r gram5.ceres.auckland.ac.nz 
> '&(executable=echo)(arguments= 
> hello)(job_type=single)(count=1)(hostCount=1)(vo="/nz/nesi")(maxWalltime=10)(directory=/home/smas036)'
> globus_gram_client_callback_allow successful
> GRAM Job submission successful
> https://gram5.ceres.auckland.ac.nz:40398/16145891598704212781/576663433152357309/
> 
> 
> submission with two-phase does not hang and results in:
> globusrun  -batch     -r gram5.ceres.auckland.ac.nz 
> '&(two_phase=5)(executable=echo)(arguments=
> hello)(job_type=single)(count=1)(hostCount=1)(vo="/nz/nesi")(maxWalltime=10)(directory=/home/smas036)'
> globus_gram_client_callback_allow successful
> GRAM Job submission failed because the job contact string does not match any 
> which the job manager is handling (error code 156)
> https://gram5.ceres.auckland.ac.nz:40398/16145891597960224316/576663433152357309/
> 
> 
> our users are getting into this problem all the time, but I cannot reproduce 
> putting job manager into that state. They can submit again when I kill it.
> 
> We haven't seen this, before our job submission software started submitting 
> jobs with two-phase.
> 
> Cheers,
> Yuriy

Reply via email to