> The restart operation ought to work. Yeah it does, my mistake. I tried to restart job with very short proxy lifetime and it returned 131 status. Also new job submissions fail with proxy lifetime < 10 min (?). But if the proxy is fresh, restart works as it should.
> (potentially) the TCP port number. This ought to be enough to get the audit > record to happen. The problem with that is jobs should be audited regardless of how user behaves. If job is never checked again (user just copied outputs over gridftp and forgot about the job), there is no record. > Note that if you do the GRAM two-phase commit protocol, the job state will > remain > in place until a client acknowledges it, so that you can do #2 and check for > status whenever you are able, even after the job terminates. thanks, this is exactly what I need. My code used to assume that if restart fails, the job is complete, but using two-phase I can actually see what happened. Cheers, Yuriy Cheers, Yuriy Excerpts from Joseph Bester's message of Fri Aug 12 02:48:59 +1200 2011: > On Aug 10, 2011, at 7:26 PM, Yuriy Halytskyy wrote: > > Hi, > > > > When job proxy expires, its manager dies, but the job itself keeps > > running. There is no way to check its status, and even when job is > > submitted with save_state=true, the restart does not work. Also, audit > > never puts the record of completed job into the database. > > > > On the other hand if I submit two jobs, first one with long proxy, and > > second job with shorter proxy, even when second proxy expires I can > > still query the job as long as first proxy is valid and job manager is > > running. > > > > GRAM4 never had this problem, even when proxy expires job status is > > still available and it is properly audited. Is it possible for gram5 > > to have the same behaviour? At least being able to restart job manager > > after proxy expiration and have it properly audited. > > The restart operation ought to work. There are a few ways it can happen; > depending on how you are monitoring the job, you might have to do different > things. > > 1. Submit any job to the same resource as the original job. When a new job > manager is started, it will resume monitoring whatever jobs remain from > previous job managers. Job state callbacks will be sent to clients which > were registered to the previous job manager process with the new job > manager > contact. This contact will be the same as the old contact except for > (potentially) the TCP port number. This ought to be enough to get the audit > record to happen. > > 2. Submit a job with the RSL &(restart=old-job-manager-contact). The response > to this will be the new job contact and the jobs current state. If there > was > no job manager running, it will act like #1 as well, resuming all existing > job monitoring and state callback operations. > > If you attempt to use the gram status API instead of relying on callbacks, > you > won't be able to get status unless you do #2, because you won't know the port > to contact. I'd like to some day add more messaging through the gatekeeper so > that the job manager doesn't have to have it's own port for receiving messages > and we don't have to deal with such problems. > > Note that if you do the GRAM two-phase commit protocol, the job state will > remain > in place until a client acknowledges it, so that you can do #2 and check for > status whenever you are able, even after the job terminates. > > Joe
