Re: [gt-user] questions about SEG, lost SGE jobs, and general stability

Joseph Bester Mon, 06 Feb 2012 08:47:59 -0800

On Feb 6, 2012, at 1:43 AM, Brian O'Connor wrote:
> Hi,
> 
> I'm in the process of stress testing a system for use in production.
> The software stack is:
> 
> pegasus workflows -> condor-G -> GRAM -> SGE
> 
> I recently upgraded to GT 5.2 to attempt to address stability issues.
> I was having a whole series of issues with the globus-job-manager
> either becoming unresponsive, dying, opening too many file handles,
> etc.  Sorry in advanced if I ask some stupid questions below, I'm
> still trying to wrap my mind around the whole software stack and the
> best way to set it up.  I'm not submitting tens of thousands of jobs
> here... something on the order of <500 max.  So I feel that I should
> be able to get this system rock-solid yet it's been super flaky for me
> which is troubling.
> 
> So this brings up my first question:
> 
> 1) What do you do when the globus-job-manager dies?  It seems like
> this is a very critical daemon, the URLs I get back if I manually use
> globus-job-submit no longer work and I end up with random jobs running
> on the cluster that I have no way to shutdown.  My question here is
> how do I "cleanup" after a globus-job-manager crash (or a server
> reboot) and is there a way to trigger a new instance to startup and
> watch over the jobs left running on the cluster?  From the calling
> client's perspective how would they reconnect with these lost jobs
> when the job URL is no longer valid?  Has anyone had any experiences
> with the added complexity of Condor-G and how that deals with a
> globus-job-manager failure?


GRAM-wise, it's possible to get the new handle by submitting a restart
job with the old job handle. This will work even if the old job manager
is running, in which case it will return the old handle.

If any operation that causes the job manager to get started (submit, restart,
version check) detects that the job manager is not running, it will make
sure one is started to monitor existing jobs (as long as it has a valid proxy).
It will automatically clean up jobs if the client has done the two-phase
commit end. For 5.2.1, we're working on having some expiration time for
missed two-phase commits so that a job completely abandoned by a user will
eventually get cleaned up.

So, Condor should be able to get a new handle to the job by submitting a
restart job with the old job handle. I'm not too sure about the specifics
of the condor-G case, but it seems to do a lot of job manager restarts, so
I'd be a little surprised if it isn't capable of getting the handle back.
Maybe ask on a condor list about what you are seeing.

Of course, if the job manager is crashing, please report a problem to 
jira.globus.org so we can fix it.

> My second question is related to SEG.  I know SEG is the better way to
> go.  I'm currently using polling, though, since our sysadmin and I
> weren't able to get SEG to work with SGE.  When deployed we saw it
> parse through the SGE log and, once it hits the end, it started
> looking for a reporting.0 file rather than just wait for more data to
> be written to the log file.  There is no reporting.0 file so the event
> generator just sat there and kept looking for it rather than watching
> the real log file.  So question (2) has anyone seen problems with the
> SEG module for SGE looking for a non-existent reporting file and,
> therefore, missing new job events?  Is there any way to explicitly
> tell SEG to read only one reporting file and not to try looking for
> rotated log files?

Based on a quick reading of the code, it looks like it should recover back to
reading the main log file after processing any rotated one. One potential
hangup with the SGE module is that it relies on SGE writing the reporting
file, but not processing that file in with the dbwriter process, as that
alters the contents of the reporting file in a way that confuses the SEG
module.

> OK, since using polling (at least until we get SEG working) that
> brings me to the third and final questions.  I submitted about 25
> workflows via Condor DagMan.  At any one time there are between 1 and
> 20 jobs running per workflow.  The load on the host (which is our
> condor machine, GRAM server, and SGE submission host) is only slight.
> However, when I do:
> 
> condor_q -long 6731
> 
> And look at a job that should have finished hours ago I see the Globus URL is:
> 
> https://sqwstage:59384/16217785859533749891/5764607537333819270/
> 
> So I check the status:
> 
> globus-job-status
> https://sqwstage:59384/16217785859533749891/5764607537333819270/
> ACTIVE
> 
> So Globus (and Condor by extension) think the job is running.  It
> shouldn't be, though since this is a fast running job.
> 
> I then look at every job running on the SGE cluster:
> 
> for i in `qstat | grep seqware | awk '{print $1}'`; do qstat -j | grep
> '16217785859533749891.5764607537333819270'; done;
> 
> And I get nothing... the job is not running on the cluster.  I think
> it's because it ran and finished but Globus didn't pick up on this.  I
> suppose it's possible the job never ran but unlikely because I think
> Globus would report pending.
> 
> Anyway, this brings me to question (3), what do I do in this
> situation, when globus looses track of a job state in SGE?  I hate to
> cancel the job and trigger condor to re-run it.  Is there a way to
> request the globus-job-manager to check this job and, if it's not on
> the cluster, signal that it's done?  I can write code that monitors my
> running condor/globus jobs and looks for those that aren't running on
> the cluster but respond as ACTIVE but this just feels like a total
> hack.  These all sound like horrible options to me.  So my extended
> question is why is this happening at all?  I thought the whole point
> of using Globus here is being able to submit jobs to a scheduler and
> not having to worry about issues like this.  Are my expectations too
> high?  Is it normal to have jobs lost track of? Is it not unusual to
> have to monitor jobs across the various layers and manually intervene?

The poll method in the sge.pm module is run periodically for all jobs that
GRAM knows about. If you set GRAM to log DEBUG messages, you might see
some info from that script to let you know what it is doing. See
http://www.globus.org/toolkit/docs/5.2/5.2.0/gram5/admin/#id2483565
for dealing with the log options.

> Thanks very much for any guidance, suggestions, hints, or comments you
> can provide.  Any tips you can give me that will help us get the
> software stack stable will be very welcome!!
> 
> --Brian O'Connor

Re: [gt-user] questions about SEG, lost SGE jobs, and general stability

Reply via email to