ok, now I can reproduce it. When proxy expires when job manager is waiting for COMMIT_END signal, it stops accepting new jobs. It seems I can restore it by sending commit_end, but this still looks like a bug to me as client may loose job id.
Cheers, Yuriy Excerpts from Yuriy Halytskyy's message of Tue Sep 13 15:59:09 +1200 2011: > Hi, > > user job manager gets into the state where the submission with globusrun > hangs and job is never submitted > > server logs say: > Sep 13 15:35:30 gram5 gridinfo[30804]: ts=2011-09-13T03:35:30.104424Z > id=30804 event=gram.job.start level=INFO > gramid=/16145890501405029996/576663433152357309/ > peer=130.216.189.203:57672 > Sep 13 15:35:30 gram5 gridinfo[30804]: ts=2011-09-13T03:35:30.104582Z > id=30804 event=gram.add_request.end level=WARN > gramid=/16145890501405029996/576663433152357309/ status=-130 > reason="the job manager was sent a stop signal (job is still running)" > Sep 13 15:35:30 gram5 gridinfo[30804]: ts=2011-09-13T03:35:30.104885Z > id=30804 event=gram.job.end level=INFO > gramid=/16145890501405029996/576663433152357309/ status=-130 msg="Request > start failed" reason="the job manager was sent a stop signal (job is still > running)" > > submission with globusrun hangs: > > globusrun -batch -r gram5.ceres.auckland.ac.nz > '&(executable=echo)(arguments= > hello)(job_type=single)(count=1)(hostCount=1)(vo="/nz/nesi")(maxWalltime=10)(directory=/home/smas036)' > globus_gram_client_callback_allow successful > GRAM Job submission successful > https://gram5.ceres.auckland.ac.nz:40398/16145891598704212781/576663433152357309/ > > > submission with two-phase does not hang and results in: > globusrun -batch -r gram5.ceres.auckland.ac.nz > '&(two_phase=5)(executable=echo)(arguments= > hello)(job_type=single)(count=1)(hostCount=1)(vo="/nz/nesi")(maxWalltime=10)(directory=/home/smas036)' > globus_gram_client_callback_allow successful > GRAM Job submission failed because the job contact string does not match any > which the job manager is handling (error code 156) > https://gram5.ceres.auckland.ac.nz:40398/16145891597960224316/576663433152357309/ > > > our users are getting into this problem all the time, but I cannot reproduce > putting job manager into that state. They can submit again when I kill it. > > We haven't seen this, before our job submission software started submitting > jobs with two-phase. > > Cheers, > Yuriy
