Quick note on the last paragraph: We don't use a SEG because it proved to be
unreliable.
On 24/09/13 8:26 AM, Markus Binsteiner wrote:
Hi Joe,
thanks for your thoughts on this, I'll try to get more info from the
logs. Although the problem now went away almost totally.
We had problems with LoadLeveler over the last few days, but once we
figured out what the problem was and worked around it those grid-related
issues went away too. It was just a bit strange because globus seemed to
have lost jobs at a much higher rate than our 'normal' LoadLeveler users
(of whom we have way more).
My current working theory is that globus tried to check the status more
often than a normal user would and therefor was much more likely to find
it in a broken state. Once that happened it considered the job as gone
and deleted those job files. Does that sound possible to you?
Best,
Markus
On Thu, 2013-09-19 at 11:51 -0400, Joseph Bester wrote:
That's normally not deleted until the job is completed and the two-phase commit is done.
The other reason why GRAM might delete it would be if the job expires (after it hits an
end state and hasn't been touched in 4 hours). Is there a possibility of something else
"cleaning" out that directory? Do those files exist?
It's possible to increase the logging level as described here:
http://www.globus.org/toolkit/docs/5.2/5.2.4/gram5/admin/#idp7912160 which
might give some info about what the job manager thinks is going on.
Joe
On Sep 18, 2013, at 3:33 PM, Markus Binsteiner <[email protected]>
wrote:
Hi.
We are experiencing a mayor problems with loosing job states, after a
while (an hour or so) every job we submit via globus ends up in an
unknown state. I'm not quite sure where to start looking, the logs say:
ts=2013-09-18T19:20:31.006776Z id=14670 event=gram.state_file_read.end
level=ERROR gramid=/16361930530915519966/6437524403105335712/
path=/var/lib/globus/gram_job_state/mbin029/16966e4/loadleveler/job.16361930530915519966.6437524403105335712
msg="Error checking file status" status=-121 errno=2 reason="No such file or
directory"
everytime another status is lost. We are using jglobus (1.8.x),
two-phase commit and we poll the LRM (LoadLeveler -- not using scheduler
event generator).
Any idea what could cause those files to be deleted?
Best,
Markus
--
Martin Feller
Centre for eResearch, The University of Auckland
24 Symonds Street, Building 409, Room G21
e: [email protected]
p: +64 9 3737599 ext 82099