Hi,
Just chipping in - while trying to look (on our separate system, using
LoadLeveler with SEG) into the issue Markus is observing, I've run into
something else in the job manager:
Globus job manager wasn't picking up that my test job has completed so I
had a look.
With lsof, I could see it was still reading SEG events from
/var/lib/globus/globus-seg-loadleveler/20130916 and was stubbornly
trying to open /var/lib/globus/globus-seg-loadleveler/20130917
- even though the current events went to a newer file based on the
current date: 20130925
Just doing
touch /var/lib/globus/globus-seg-loadleveler/20130917
helped to get the job manager out of the loop trying to open 20130917
and then switched to the current file 20130925.
Would that count as a bug - the job manager not being able to skip dates
that have no events when switching between SEG files?
Cheers,
Vlad
On 24/09/13 08:39, Martin Feller wrote:
Quick note on the last paragraph: We don't use a SEG because it proved
to be unreliable.
On 24/09/13 8:26 AM, Markus Binsteiner wrote:
Hi Joe,
thanks for your thoughts on this, I'll try to get more info from the
logs. Although the problem now went away almost totally.
We had problems with LoadLeveler over the last few days, but once we
figured out what the problem was and worked around it those grid-related
issues went away too. It was just a bit strange because globus seemed to
have lost jobs at a much higher rate than our 'normal' LoadLeveler users
(of whom we have way more).
My current working theory is that globus tried to check the status more
often than a normal user would and therefor was much more likely to find
it in a broken state. Once that happened it considered the job as gone
and deleted those job files. Does that sound possible to you?
Best,
Markus
On Thu, 2013-09-19 at 11:51 -0400, Joseph Bester wrote:
That's normally not deleted until the job is completed and the
two-phase commit is done. The other reason why GRAM might delete it
would be if the job expires (after it hits an end state and hasn't
been touched in 4 hours). Is there a possibility of something else
"cleaning" out that directory? Do those files exist?
It's possible to increase the logging level as described here:
http://www.globus.org/toolkit/docs/5.2/5.2.4/gram5/admin/#idp7912160
which might give some info about what the job manager thinks is going
on.
Joe
On Sep 18, 2013, at 3:33 PM, Markus Binsteiner
<[email protected]> wrote:
Hi.
We are experiencing a mayor problems with loosing job states, after a
while (an hour or so) every job we submit via globus ends up in an
unknown state. I'm not quite sure where to start looking, the logs say:
ts=2013-09-18T19:20:31.006776Z id=14670 event=gram.state_file_read.end
level=ERROR gramid=/16361930530915519966/6437524403105335712/
path=/var/lib/globus/gram_job_state/mbin029/16966e4/loadleveler/job.16361930530915519966.6437524403105335712
msg="Error checking file status" status=-121 errno=2 reason="No such
file or directory"
everytime another status is lost. We are using jglobus (1.8.x),
two-phase commit and we poll the LRM (LoadLeveler -- not using
scheduler
event generator).
Any idea what could cause those files to be deleted?
Best,
Markus
--
Vladimir Mencl, Ph.D.
E-Research Services and Systems Consultant
BlueFern Computing Services
University of Canterbury
Private Bag 4800
Christchurch 8140
New Zealand
http://www.bluefern.canterbury.ac.nz
mailto:[email protected]
Phone: +64 3 364 3012
Mobile: +64 21 997 352
Fax: +64 3 364 3002