Hi Yuriy, On Mon, Aug 10, 2009 at 6:34 AM, Yuriy<[email protected]> wrote: > Some of the jobs submitted to torque via GRAM are killed after about > 24 hours in the queue, all with the similar message in globus logs:
We've also seen this issue with our GT4 client in the Nimrod toolkit (www.messagelab.monash.edu.au/Nimrod). It appears to be a limitation with the WSRF resource lifetime model implemented in GT4.0. WS-GRAM has a termination option, "-term" for the globusrun-ws C client (poorly documented in the help I might add), which sets the expiration/termination time for the WSRF resource associated with the job. The C client sets this by default to 24 hours - without regard for any greater walltime you provide! I believe that the server container "cleans" up these resources after the termination time with the unfortunate default behaviour of deleting the local job if it is still active. There is some more information regarding this issue here: https://ticket.grid.iu.edu/goc/viewer?id=4896 and Globus documentation of the feature here: http://www.globus.org/toolkit/docs/4.0/execution/wsgram/user-index.html#s-wsgram-user-lifetime It seems like this has been addressed in the GT4.2 resource model. I personally would like some clarification about which versions this effects and whether there have been patches to the server to avoid job deletion on resource expiration...? We've observed the issue with VDT1.8.1 clients against Vanilla GT4.0.5 servers and VDT1.8.1 servers, but do not have the issue against our VDT1.10.1(s) server (although we still see the NoSuchResourceException when querying the job our jobs do not get deleted). And further to this, when we added the -term argument to our command line wrapper API we discovered it was only possible to specify 9d23h into the future with the +<HH:MM> form of the argument, e.g. [bl...@nimrod1 ~]$ globusrun-ws -s -S -submit -F east-globus.enterprisegrid.edu.au -Ft Fork -term +240:00 -c /bin/date Delegating user credentials...Done. Submitting job...Failed. Cleaning up any delegated credentials...Done. globusrun-ws: globus_i_submit.c::731: Error submitting job ManagedJobFactoryService_client.c::5877: SOAP Fault Fault code: soapenv:Server.generalException Fault detail: <ns1:BaseFault xmlns:ns1="http://docs.oasis-open.org/wsrf/2004/06/wsrf-WS-BaseFaults-1.2-draft-01.xsd"><ns1:Timestamp>2009-08-09T23:56:06.965Z</ns1:Timestamp><ns1:Originator><wsa:Address>https://east-globus.enterprisegrid.edu.au:8443/wsrf/services/ManagedJobFactoryService</wsa:Address><wsa:ReferenceProperties><ns01:ResourceID xmlns:ns01="http://www.globus.org/namespaces/2004/10/gram/job">Fork</ns01:ResourceID></wsa:ReferenceProperties><wsa:ReferenceParameters/></ns1:Originator></ns1:BaseFault> [bl...@nimrod1 ~]$ globusrun-ws -s -S -submit -F east-globus.enterprisegrid.edu.au -Ft Fork -term +239:59 -c /bin/date Delegating user credentials...Done. Submitting job...Done. Job ID: uuid:3a6dcaf6-8540-11de-ac28-000cf1790d71 Termination time: 08/10/2009 22:56 GMT Current job state: Active Current job state: CleanUp-Hold Mon Aug 10 09:56:21 EST 2009 Current job state: CleanUp Current job state: Done Destroying job...Done. Cleaning up any delegated credentials...Done. Regards, ~Blair -- In science one tries to tell people, in such a way as to be understood by everyone, something that no one ever knew before. But in poetry, it's the exact opposite. - Paul Dirac
