On Aug 10, 2011, at 7:26 PM, Yuriy Halytskyy wrote:
> Hi,
> 
> When job proxy expires, its manager dies, but the job itself keeps
> running. There is no way to check its status, and even when job is
> submitted with save_state=true, the restart does not work. Also, audit
> never puts the record of completed job into the database. 
> 
> On the other hand if I submit two jobs, first one with long proxy, and
> second job with shorter proxy, even when second proxy expires I can
> still query the job as long as first proxy is valid and job manager is
> running. 
> 
> GRAM4 never had this problem, even when proxy expires job status is
> still available and it is properly audited. Is it possible for gram5
> to have the same behaviour? At least being able to restart job manager
> after proxy expiration and have it properly audited. 

The restart operation ought to work. There are a few ways it can happen;
depending on how you are monitoring the job, you might have to do different
things.

1. Submit any job to the same resource as the original job. When a new job
   manager is started, it will resume monitoring whatever jobs remain from
   previous job managers. Job state callbacks will be sent to clients which
   were registered to the previous job manager process with the new job manager
   contact. This contact will be the same as the old contact except for
   (potentially) the TCP port number. This ought to be enough to get the audit
   record to happen.

2. Submit a job with the RSL &(restart=old-job-manager-contact). The response 
   to this will be the new job contact and the jobs current state. If there was
   no job manager running, it will act like #1 as well, resuming all existing
   job monitoring and state callback operations.

If you attempt to use the gram status API instead of relying on callbacks, you 
won't be able to get status unless you do #2, because you won't know the port 
to contact. I'd like to some day add more messaging through the gatekeeper so
that the job manager doesn't have to have it's own port for receiving messages
and we don't have to deal with such problems.

Note that if you do the GRAM two-phase commit protocol, the job state will 
remain 
in place until a client acknowledges it, so that you can do #2 and check for
status whenever you are able, even after the job terminates.

Joe

Reply via email to