Hi Yuriy,

This clicks with the experience I have with GRAM5.  GT5 tries to "merge"
jobs running under the same account, so that they are managed by the
same job manager.  This improves scaling - if a user submits a large
number of jobs, you'd still have only one job manager running.

We've run into issues when we used this with our model of shared
accounts (yes, we are now moving from shared accounts to account pools).
 GRAM5 was trying to run all jobs of all users under the shared account
under a single account - with the certificate submitted with the first
job being used for all.  That was breaking any attempts to retrieve the
job status for the other jobs - and was solved by running a separate job
manager for each combination of <local account, DN>

But what you are observing now clicks into this experience.  Clicks as
to why it's happening - but not convincing me as that being the right
thing that should happen.

Having inconsistent behavior as to what happens when a short-lived proxy
expires (depending on whether there is another job running with a longer
lived proxy) is quite a bad thing.

I understand the job manager cannot continue running when the proxy
expires, but at least reconnecting to the job / restarting the job
manager and getting reliable audit messages should work.

That looks like a GRAM5 bug to me.  Would you be able to investigate and
collect more data on how the restart breaks?


Cheers,
Vlad

Yuriy Halytskyy wrote:
> Hi,
> 
> When job proxy expires, its manager dies, but the job itself keeps
> running. There is no way to check its status, and even when job is
> submitted with save_state=true, the restart does not work. Also, audit
> never puts the record of completed job into the database. 
> 
> On the other hand if I submit two jobs, first one with long proxy, and
> second job with shorter proxy, even when second proxy expires I can
> still query the job as long as first proxy is valid and job manager is
> running. 
> 
> GRAM4 never had this problem, even when proxy expires job status is
> still available and it is properly audited. Is it possible for gram5
> to have the same behaviour? At least being able to restart job manager
> after proxy expiration and have it properly audited. 
> 
> 
> Cheers,
> Yuriy


-- 
Vladimir Mencl, Ph.D.
E-Research Services and Systems Consultant
BlueFern Computing Services
University of Canterbury
Private Bag 4800
Christchurch 8140
New Zealand

http://www.bluefern.canterbury.ac.nz
mailto:[email protected]
Phone: +64 3 364 3012
Mobile: +64 21 997 352
Fax: +64 3 364 3002

Reply via email to