Hi,

I'm in the process of stress testing a system for use in production.
The software stack is:

pegasus workflows -> condor-G -> GRAM -> SGE

I recently upgraded to GT 5.2 to attempt to address stability issues.
I was having a whole series of issues with the globus-job-manager
either becoming unresponsive, dying, opening too many file handles,
etc.  Sorry in advanced if I ask some stupid questions below, I'm
still trying to wrap my mind around the whole software stack and the
best way to set it up.  I'm not submitting tens of thousands of jobs
here... something on the order of <500 max.  So I feel that I should
be able to get this system rock-solid yet it's been super flaky for me
which is troubling.

So this brings up my first question:

1) What do you do when the globus-job-manager dies?  It seems like
this is a very critical daemon, the URLs I get back if I manually use
globus-job-submit no longer work and I end up with random jobs running
on the cluster that I have no way to shutdown.  My question here is
how do I "cleanup" after a globus-job-manager crash (or a server
reboot) and is there a way to trigger a new instance to startup and
watch over the jobs left running on the cluster?  From the calling
client's perspective how would they reconnect with these lost jobs
when the job URL is no longer valid?  Has anyone had any experiences
with the added complexity of Condor-G and how that deals with a
globus-job-manager failure?

My second question is related to SEG.  I know SEG is the better way to
go.  I'm currently using polling, though, since our sysadmin and I
weren't able to get SEG to work with SGE.  When deployed we saw it
parse through the SGE log and, once it hits the end, it started
looking for a reporting.0 file rather than just wait for more data to
be written to the log file.  There is no reporting.0 file so the event
generator just sat there and kept looking for it rather than watching
the real log file.  So question (2) has anyone seen problems with the
SEG module for SGE looking for a non-existent reporting file and,
therefore, missing new job events?  Is there any way to explicitly
tell SEG to read only one reporting file and not to try looking for
rotated log files?

OK, since using polling (at least until we get SEG working) that
brings me to the third and final questions.  I submitted about 25
workflows via Condor DagMan.  At any one time there are between 1 and
20 jobs running per workflow.  The load on the host (which is our
condor machine, GRAM server, and SGE submission host) is only slight.
However, when I do:

condor_q -long 6731

And look at a job that should have finished hours ago I see the Globus URL is:

https://sqwstage:59384/16217785859533749891/5764607537333819270/

So I check the status:

globus-job-status
https://sqwstage:59384/16217785859533749891/5764607537333819270/
ACTIVE

So Globus (and Condor by extension) think the job is running.  It
shouldn't be, though since this is a fast running job.

I then look at every job running on the SGE cluster:

for i in `qstat | grep seqware | awk '{print $1}'`; do qstat -j | grep
'16217785859533749891.5764607537333819270'; done;

And I get nothing... the job is not running on the cluster.  I think
it's because it ran and finished but Globus didn't pick up on this.  I
suppose it's possible the job never ran but unlikely because I think
Globus would report pending.

Anyway, this brings me to question (3), what do I do in this
situation, when globus looses track of a job state in SGE?  I hate to
cancel the job and trigger condor to re-run it.  Is there a way to
request the globus-job-manager to check this job and, if it's not on
the cluster, signal that it's done?  I can write code that monitors my
running condor/globus jobs and looks for those that aren't running on
the cluster but respond as ACTIVE but this just feels like a total
hack.  These all sound like horrible options to me.  So my extended
question is why is this happening at all?  I thought the whole point
of using Globus here is being able to submit jobs to a scheduler and
not having to worry about issues like this.  Are my expectations too
high?  Is it normal to have jobs lost track of? Is it not unusual to
have to monitor jobs across the various layers and manually intervene?

Thanks very much for any guidance, suggestions, hints, or comments you
can provide.  Any tips you can give me that will help us get the
software stack stable will be very welcome!!

--Brian O'Connor

Reply via email to