Hi Stu, it is rather serious, normally there is a workaround for users (just request their old job statuses) but we cannot detect this issue automatically from client side as jglobus does not return errror on submission - no exception is thrown, and error code is set to zero.
Cheers, Yuriy Excerpts from Stuart Martin's message of Fri Sep 16 03:57:20 +1200 2011: > Hi Yuriy, > > We think a similar issue was hit and fixed in (soon to be released) GT 5.1.2. > It has not yet been back ported to 5.0.x > > What is the priority on this? How much is this affecting you / your users? > > -Stu > > On Sep 13, 2011, at Sep 13, 1:27 AM, Yuriy Halytskyy wrote: > > > ok, now I can reproduce it. When proxy expires when job manager is > > waiting for COMMIT_END signal, it stops accepting new jobs. It seems I > > can restore it by sending commit_end, but this still looks like a bug > > to me as client may loose job id. > > > > > > Cheers, > > Yuriy > > > > Excerpts from Yuriy Halytskyy's message of Tue Sep 13 15:59:09 +1200 2011: > >> Hi, > >> > >> user job manager gets into the state where the submission with globusrun > >> hangs and job is never submitted > >> > >> server logs say: > >> Sep 13 15:35:30 gram5 gridinfo[30804]: ts=2011-09-13T03:35:30.104424Z > >> id=30804 event=gram.job.start level=INFO > >> gramid=/16145890501405029996/576663433152357309/ > >> peer=130.216.189.203:57672 > >> Sep 13 15:35:30 gram5 gridinfo[30804]: ts=2011-09-13T03:35:30.104582Z > >> id=30804 event=gram.add_request.end level=WARN > >> gramid=/16145890501405029996/576663433152357309/ status=-130 > >> reason="the job manager was sent a stop signal (job is still running)" > >> Sep 13 15:35:30 gram5 gridinfo[30804]: ts=2011-09-13T03:35:30.104885Z > >> id=30804 event=gram.job.end level=INFO > >> gramid=/16145890501405029996/576663433152357309/ status=-130 msg="Request > >> start failed" reason="the job manager was sent a stop signal (job is still > >> running)" > >> > >> submission with globusrun hangs: > >> > >> globusrun -batch -r gram5.ceres.auckland.ac.nz > >> '&(executable=echo)(arguments= > >> hello)(job_type=single)(count=1)(hostCount=1)(vo="/nz/nesi")(maxWalltime=10)(directory=/home/smas036)' > >> globus_gram_client_callback_allow successful > >> GRAM Job submission successful > >> https://gram5.ceres.auckland.ac.nz:40398/16145891598704212781/576663433152357309/ > >> > >> > >> submission with two-phase does not hang and results in: > >> globusrun -batch -r gram5.ceres.auckland.ac.nz > >> '&(two_phase=5)(executable=echo)(arguments= > >> hello)(job_type=single)(count=1)(hostCount=1)(vo="/nz/nesi")(maxWalltime=10)(directory=/home/smas036)' > >> globus_gram_client_callback_allow successful > >> GRAM Job submission failed because the job contact string does not match > >> any which the job manager is handling (error code 156) > >> https://gram5.ceres.auckland.ac.nz:40398/16145891597960224316/576663433152357309/ > >> > >> > >> our users are getting into this problem all the time, but I cannot > >> reproduce putting job manager into that state. They can submit again when > >> I kill it. > >> > >> We haven't seen this, before our job submission software started > >> submitting jobs with two-phase. > >> > >> Cheers, > >> Yuriy
