Ok - I've got it recorded here for Joe to look at. He's off today, but will
look into this early next week.
http://jira.globus.org/browse/GRAM-252
-Stu
On Sep 15, 2011, at Sep 15, 9:20 PM, Yuriy Halytskyy wrote:
> Hi Stu,
>
> it is rather serious, normally there is a workaround for users (just
> request their old job statuses) but we cannot detect this issue
> automatically from client side as jglobus does not return errror on
> submission - no exception is thrown, and error code is set to zero.
>
> Cheers,
> Yuriy
>
> Excerpts from Stuart Martin's message of Fri Sep 16 03:57:20 +1200 2011:
>> Hi Yuriy,
>>
>> We think a similar issue was hit and fixed in (soon to be released) GT
>> 5.1.2. It has not yet been back ported to 5.0.x
>>
>> What is the priority on this? How much is this affecting you / your users?
>>
>> -Stu
>>
>> On Sep 13, 2011, at Sep 13, 1:27 AM, Yuriy Halytskyy wrote:
>>
>>> ok, now I can reproduce it. When proxy expires when job manager is
>>> waiting for COMMIT_END signal, it stops accepting new jobs. It seems I
>>> can restore it by sending commit_end, but this still looks like a bug
>>> to me as client may loose job id.
>>>
>>>
>>> Cheers,
>>> Yuriy
>>>
>>> Excerpts from Yuriy Halytskyy's message of Tue Sep 13 15:59:09 +1200 2011:
>>>> Hi,
>>>>
>>>> user job manager gets into the state where the submission with globusrun
>>>> hangs and job is never submitted
>>>>
>>>> server logs say:
>>>> Sep 13 15:35:30 gram5 gridinfo[30804]: ts=2011-09-13T03:35:30.104424Z
>>>> id=30804 event=gram.job.start level=INFO
>>>> gramid=/16145890501405029996/576663433152357309/
>>>> peer=130.216.189.203:57672
>>>> Sep 13 15:35:30 gram5 gridinfo[30804]: ts=2011-09-13T03:35:30.104582Z
>>>> id=30804 event=gram.add_request.end level=WARN
>>>> gramid=/16145890501405029996/576663433152357309/ status=-130
>>>> reason="the job manager was sent a stop signal (job is still running)"
>>>> Sep 13 15:35:30 gram5 gridinfo[30804]: ts=2011-09-13T03:35:30.104885Z
>>>> id=30804 event=gram.job.end level=INFO
>>>> gramid=/16145890501405029996/576663433152357309/ status=-130 msg="Request
>>>> start failed" reason="the job manager was sent a stop signal (job is still
>>>> running)"
>>>>
>>>> submission with globusrun hangs:
>>>>
>>>> globusrun -batch -r gram5.ceres.auckland.ac.nz
>>>> '&(executable=echo)(arguments=
>>>> hello)(job_type=single)(count=1)(hostCount=1)(vo="/nz/nesi")(maxWalltime=10)(directory=/home/smas036)'
>>>> globus_gram_client_callback_allow successful
>>>> GRAM Job submission successful
>>>> https://gram5.ceres.auckland.ac.nz:40398/16145891598704212781/576663433152357309/
>>>>
>>>>
>>>> submission with two-phase does not hang and results in:
>>>> globusrun -batch -r gram5.ceres.auckland.ac.nz
>>>> '&(two_phase=5)(executable=echo)(arguments=
>>>> hello)(job_type=single)(count=1)(hostCount=1)(vo="/nz/nesi")(maxWalltime=10)(directory=/home/smas036)'
>>>> globus_gram_client_callback_allow successful
>>>> GRAM Job submission failed because the job contact string does not match
>>>> any which the job manager is handling (error code 156)
>>>> https://gram5.ceres.auckland.ac.nz:40398/16145891597960224316/576663433152357309/
>>>>
>>>>
>>>> our users are getting into this problem all the time, but I cannot
>>>> reproduce putting job manager into that state. They can submit again when
>>>> I kill it.
>>>>
>>>> We haven't seen this, before our job submission software started
>>>> submitting jobs with two-phase.
>>>>
>>>> Cheers,
>>>> Yuriy