Hi Stu,

it is rather serious, normally there is a workaround for users (just
request their old job statuses) but we cannot detect this issue
automatically from client side as jglobus does not return errror on
submission - no exception is thrown, and error code is set to zero.

Cheers,
Yuriy

Excerpts from Stuart Martin's message of Fri Sep 16 03:57:20 +1200 2011:
> Hi Yuriy,
> 
> We think a similar issue was hit and fixed in (soon to be released) GT 5.1.2. 
>  It has not yet been back ported to 5.0.x
> 
> What is the priority on this?  How much is this affecting you / your users?
> 
> -Stu
> 
> On Sep 13, 2011, at Sep 13, 1:27 AM, Yuriy Halytskyy wrote:
> 
> > ok, now I can reproduce it. When proxy expires when job manager is
> > waiting for COMMIT_END signal, it stops accepting new jobs. It seems I
> > can restore it by sending commit_end, but this still looks like a bug
> > to me as client may loose job id. 
> > 
> > 
> > Cheers,
> > Yuriy
> > 
> > Excerpts from Yuriy Halytskyy's message of Tue Sep 13 15:59:09 +1200 2011:
> >> Hi,
> >> 
> >> user job manager gets into the state where the submission with globusrun 
> >> hangs and job is never submitted
> >> 
> >> server logs say:
> >> Sep 13 15:35:30 gram5 gridinfo[30804]: ts=2011-09-13T03:35:30.104424Z 
> >> id=30804 event=gram.job.start level=INFO 
> >> gramid=/16145890501405029996/576663433152357309/
> >> peer=130.216.189.203:57672
> >> Sep 13 15:35:30 gram5 gridinfo[30804]: ts=2011-09-13T03:35:30.104582Z 
> >> id=30804 event=gram.add_request.end level=WARN 
> >> gramid=/16145890501405029996/576663433152357309/ status=-130
> >> reason="the job manager was sent a stop signal (job is still running)"
> >> Sep 13 15:35:30 gram5 gridinfo[30804]: ts=2011-09-13T03:35:30.104885Z 
> >> id=30804 event=gram.job.end level=INFO 
> >> gramid=/16145890501405029996/576663433152357309/ status=-130 msg="Request
> >> start failed" reason="the job manager was sent a stop signal (job is still 
> >> running)"
> >> 
> >> submission with globusrun hangs:
> >> 
> >> globusrun -batch     -r gram5.ceres.auckland.ac.nz 
> >> '&(executable=echo)(arguments= 
> >> hello)(job_type=single)(count=1)(hostCount=1)(vo="/nz/nesi")(maxWalltime=10)(directory=/home/smas036)'
> >> globus_gram_client_callback_allow successful
> >> GRAM Job submission successful
> >> https://gram5.ceres.auckland.ac.nz:40398/16145891598704212781/576663433152357309/
> >> 
> >> 
> >> submission with two-phase does not hang and results in:
> >> globusrun  -batch     -r gram5.ceres.auckland.ac.nz 
> >> '&(two_phase=5)(executable=echo)(arguments=
> >> hello)(job_type=single)(count=1)(hostCount=1)(vo="/nz/nesi")(maxWalltime=10)(directory=/home/smas036)'
> >> globus_gram_client_callback_allow successful
> >> GRAM Job submission failed because the job contact string does not match 
> >> any which the job manager is handling (error code 156)
> >> https://gram5.ceres.auckland.ac.nz:40398/16145891597960224316/576663433152357309/
> >> 
> >> 
> >> our users are getting into this problem all the time, but I cannot 
> >> reproduce putting job manager into that state. They can submit again when 
> >> I kill it.
> >> 
> >> We haven't seen this, before our job submission software started 
> >> submitting jobs with two-phase.
> >> 
> >> Cheers,
> >> Yuriy

Reply via email to