Hi Yuriy,

On Mon, Aug 10, 2009 at 6:34 AM, Yuriy<[email protected]> wrote:
>  Some of the jobs submitted to torque via GRAM are killed after about
>  24 hours in the queue, all with the similar message in globus logs:

We've also seen this issue with our GT4 client in the Nimrod toolkit
(www.messagelab.monash.edu.au/Nimrod). It appears to be a limitation
with the WSRF resource lifetime model implemented in GT4.0. WS-GRAM
has a termination option, "-term" for the globusrun-ws C client
(poorly documented in the help I might add), which sets the
expiration/termination time for the WSRF resource associated with the
job. The C client sets this by default to 24 hours - without regard
for any greater walltime you provide! I believe that the server
container "cleans" up these resources after the termination time with
the unfortunate default behaviour of deleting the local job if it is
still active.

There is some more information regarding this issue here:
https://ticket.grid.iu.edu/goc/viewer?id=4896
and Globus documentation of the feature here:
http://www.globus.org/toolkit/docs/4.0/execution/wsgram/user-index.html#s-wsgram-user-lifetime
It seems like this has been addressed in the GT4.2 resource model.

I personally would like some clarification about which versions this
effects and whether there have been patches to the server to avoid job
deletion on resource expiration...? We've observed the issue with
VDT1.8.1 clients against Vanilla GT4.0.5 servers and VDT1.8.1 servers,
but do not have the issue against our VDT1.10.1(s) server (although we
still see the NoSuchResourceException when querying the job our jobs
do not get deleted).

And further to this, when we added the -term argument to our command
line wrapper API we discovered it was only possible to specify 9d23h
into the future with the +<HH:MM> form of the argument, e.g.

[bl...@nimrod1 ~]$ globusrun-ws -s -S -submit -F
east-globus.enterprisegrid.edu.au -Ft Fork -term +240:00 -c /bin/date
Delegating user credentials...Done.
Submitting job...Failed.
Cleaning up any delegated credentials...Done.
globusrun-ws: globus_i_submit.c::731:
Error submitting job
ManagedJobFactoryService_client.c::5877:
SOAP Fault
Fault code: soapenv:Server.generalException
Fault detail:
<ns1:BaseFault 
xmlns:ns1="http://docs.oasis-open.org/wsrf/2004/06/wsrf-WS-BaseFaults-1.2-draft-01.xsd";><ns1:Timestamp>2009-08-09T23:56:06.965Z</ns1:Timestamp><ns1:Originator><wsa:Address>https://east-globus.enterprisegrid.edu.au:8443/wsrf/services/ManagedJobFactoryService</wsa:Address><wsa:ReferenceProperties><ns01:ResourceID
xmlns:ns01="http://www.globus.org/namespaces/2004/10/gram/job";>Fork</ns01:ResourceID></wsa:ReferenceProperties><wsa:ReferenceParameters/></ns1:Originator></ns1:BaseFault>

[bl...@nimrod1 ~]$ globusrun-ws -s -S -submit -F
east-globus.enterprisegrid.edu.au -Ft Fork -term +239:59 -c /bin/date
Delegating user credentials...Done.
Submitting job...Done.
Job ID: uuid:3a6dcaf6-8540-11de-ac28-000cf1790d71
Termination time: 08/10/2009 22:56 GMT
Current job state: Active
Current job state: CleanUp-Hold
Mon Aug 10 09:56:21 EST 2009
Current job state: CleanUp
Current job state: Done
Destroying job...Done.
Cleaning up any delegated credentials...Done.

Regards,
~Blair

-- 
In science one tries to tell people, in such a way
as to be understood by everyone, something that
no one ever knew before. But in poetry, it's the
exact opposite.
 - Paul Dirac

Reply via email to