Hi Joe, 

thanks for your thoughts on this, I'll try to get more info from the
logs. Although the problem now went away almost totally.

We had problems with LoadLeveler over the last few days, but once we
figured out what the problem was and worked around it those grid-related
issues went away too. It was just a bit strange because globus seemed to
have lost jobs at a much higher rate than our 'normal' LoadLeveler users
(of whom we have way more).

My current working theory is that globus tried to check the status more
often than a normal user would and therefor was much more likely to find
it in a broken state. Once that happened it considered the job as gone
and deleted those job files. Does that sound possible to you?

Best,
Markus

On Thu, 2013-09-19 at 11:51 -0400, Joseph Bester wrote:
> That's normally not deleted until the job is completed and the two-phase 
> commit is done. The other reason why GRAM might delete it would be if the job 
> expires (after it hits an end state and hasn't been touched in 4 hours). Is 
> there a possibility of something else "cleaning" out that directory? Do those 
> files exist? 
> 
> It's possible to increase the logging level as described here: 
> http://www.globus.org/toolkit/docs/5.2/5.2.4/gram5/admin/#idp7912160 which 
> might give some info about what the job manager thinks is going on.
> 
> Joe
> 
> On Sep 18, 2013, at 3:33 PM, Markus Binsteiner <m.binstei...@auckland.ac.nz> 
> wrote:
> > Hi.
> > 
> > We are experiencing a mayor problems with loosing job states, after a
> > while (an hour or so) every job we submit via globus ends up in an
> > unknown state. I'm not quite sure where to start looking, the logs say:
> > 
> > ts=2013-09-18T19:20:31.006776Z id=14670 event=gram.state_file_read.end
> > level=ERROR gramid=/16361930530915519966/6437524403105335712/
> > path=/var/lib/globus/gram_job_state/mbin029/16966e4/loadleveler/job.16361930530915519966.6437524403105335712
> >  msg="Error checking file status" status=-121 errno=2 reason="No such file 
> > or directory" 
> > 
> > everytime another status is lost. We are using jglobus (1.8.x),
> > two-phase commit and we poll the LRM (LoadLeveler -- not using scheduler
> > event generator).
> > 
> > Any idea what could cause those files to be deleted?
> > 
> > Best,
> > Markus
> 
> 


Reply via email to