Re: [gt-user] questions about SEG, lost SGE jobs, and general stability

Brian O'Connor Mon, 06 Feb 2012 12:53:38 -0800

Hi Joseph,

Thanks very much for your email.


We actually just had a failure (reboot) of out Globus box just a
couple hours ago.  So this gets at my question below about how to
cleanup after a failure.  When the machine rebooted I now see a ton of
globus-job-managers running as my "seqware" user (the one that
originally submitted the globus jobs).

  11517 seqware   16   0 30180 3532 1500 R  4.3  0.1   0:00.80 top
   3530 seqware   16   0 99372 3812 3068 S  0.6  0.1   0:02.62
globus-job-mana
   3532 seqware   15   0 99372 3812 3068 S  0.6  0.1   0:02.51
globus-job-mana
   3553 seqware   16   0 99372 3816 3068 S  0.6  0.1   0:02.53
globus-job-mana
   3591 seqware   15   0 99372 3808 3068 S  0.6  0.1   0:02.54
globus-job-mana
   3716 seqware   15   0 99372 3812 3068 S  0.6  0.1   0:02.39
globus-job-mana
   3739 seqware   15   0 99372 3812 3068 S  0.6  0.1   0:02.47
globus-job-mana

[seqware@sqwprod ~]$ ps aux | grep globus-job-manager | grep seqware | wc -l
1837

So there are 1837 of these daemons running.

I can no longer submit a cluster job using:

globus-job-run sqwprod/jobmanager-sge /bin/hostname

It just hangs.

I think this is because there is a lock file
~/.globus/job/sqwprod.hpc.oicr.on.ca/

sge.4572dcea.lock

My questions are 1) what's the proper way to reset here, kill all the
globus-job-managers, remove the lock, and allow the job manager to
repawn?  2) Why doesn't globus-job-manager (or the gateway) look at
sge.4572dcea.pid and realize the previous globus-job-manager is dead?
Shouldn't it detect this, cleanup it's state, and launch a single
replacement?

Thanks for your help.  I really appreciate it!

--Brian

On Mon, Feb 6, 2012 at 11:47 AM, Joseph Bester <[email protected]> wrote:
> On Feb 6, 2012, at 1:43 AM, Brian O'Connor wrote:
>> Hi,
>>
>> I'm in the process of stress testing a system for use in production.
>> The software stack is:
>>
>> pegasus workflows -> condor-G -> GRAM -> SGE
>>
>> I recently upgraded to GT 5.2 to attempt to address stability issues.
>> I was having a whole series of issues with the globus-job-manager
>> either becoming unresponsive, dying, opening too many file handles,
>> etc.  Sorry in advanced if I ask some stupid questions below, I'm
>> still trying to wrap my mind around the whole software stack and the
>> best way to set it up.  I'm not submitting tens of thousands of jobs
>> here... something on the order of <500 max.  So I feel that I should
>> be able to get this system rock-solid yet it's been super flaky for me
>> which is troubling.
>>
>> So this brings up my first question:
>>
>> 1) What do you do when the globus-job-manager dies?  It seems like
>> this is a very critical daemon, the URLs I get back if I manually use
>> globus-job-submit no longer work and I end up with random jobs running
>> on the cluster that I have no way to shutdown.  My question here is
>> how do I "cleanup" after a globus-job-manager crash (or a server
>> reboot) and is there a way to trigger a new instance to startup and
>> watch over the jobs left running on the cluster?  From the calling
>> client's perspective how would they reconnect with these lost jobs
>> when the job URL is no longer valid?  Has anyone had any experiences
>> with the added complexity of Condor-G and how that deals with a
>> globus-job-manager failure?
>
> GRAM-wise, it's possible to get the new handle by submitting a restart
> job with the old job handle. This will work even if the old job manager
> is running, in which case it will return the old handle.
>
> If any operation that causes the job manager to get started (submit, restart,
> version check) detects that the job manager is not running, it will make
> sure one is started to monitor existing jobs (as long as it has a valid 
> proxy).
> It will automatically clean up jobs if the client has done the two-phase
> commit end. For 5.2.1, we're working on having some expiration time for
> missed two-phase commits so that a job completely abandoned by a user will
> eventually get cleaned up.
>
> So, Condor should be able to get a new handle to the job by submitting a
> restart job with the old job handle. I'm not too sure about the specifics
> of the condor-G case, but it seems to do a lot of job manager restarts, so
> I'd be a little surprised if it isn't capable of getting the handle back.
> Maybe ask on a condor list about what you are seeing.
>
> Of course, if the job manager is crashing, please report a problem to
> jira.globus.org so we can fix it.
>
>> My second question is related to SEG.  I know SEG is the better way to
>> go.  I'm currently using polling, though, since our sysadmin and I
>> weren't able to get SEG to work with SGE.  When deployed we saw it
>> parse through the SGE log and, once it hits the end, it started
>> looking for a reporting.0 file rather than just wait for more data to
>> be written to the log file.  There is no reporting.0 file so the event
>> generator just sat there and kept looking for it rather than watching
>> the real log file.  So question (2) has anyone seen problems with the
>> SEG module for SGE looking for a non-existent reporting file and,
>> therefore, missing new job events?  Is there any way to explicitly
>> tell SEG to read only one reporting file and not to try looking for
>> rotated log files?
>
> Based on a quick reading of the code, it looks like it should recover back to
> reading the main log file after processing any rotated one. One potential
> hangup with the SGE module is that it relies on SGE writing the reporting
> file, but not processing that file in with the dbwriter process, as that
> alters the contents of the reporting file in a way that confuses the SEG
> module.
>
>> OK, since using polling (at least until we get SEG working) that
>> brings me to the third and final questions.  I submitted about 25
>> workflows via Condor DagMan.  At any one time there are between 1 and
>> 20 jobs running per workflow.  The load on the host (which is our
>> condor machine, GRAM server, and SGE submission host) is only slight.
>> However, when I do:
>>
>> condor_q -long 6731
>>
>> And look at a job that should have finished hours ago I see the Globus URL 
>> is:
>>
>> https://sqwstage:59384/16217785859533749891/5764607537333819270/
>>
>> So I check the status:
>>
>> globus-job-status
>> https://sqwstage:59384/16217785859533749891/5764607537333819270/
>> ACTIVE
>>
>> So Globus (and Condor by extension) think the job is running.  It
>> shouldn't be, though since this is a fast running job.
>>
>> I then look at every job running on the SGE cluster:
>>
>> for i in `qstat | grep seqware | awk '{print $1}'`; do qstat -j | grep
>> '16217785859533749891.5764607537333819270'; done;
>>
>> And I get nothing... the job is not running on the cluster.  I think
>> it's because it ran and finished but Globus didn't pick up on this.  I
>> suppose it's possible the job never ran but unlikely because I think
>> Globus would report pending.
>>
>> Anyway, this brings me to question (3), what do I do in this
>> situation, when globus looses track of a job state in SGE?  I hate to
>> cancel the job and trigger condor to re-run it.  Is there a way to
>> request the globus-job-manager to check this job and, if it's not on
>> the cluster, signal that it's done?  I can write code that monitors my
>> running condor/globus jobs and looks for those that aren't running on
>> the cluster but respond as ACTIVE but this just feels like a total
>> hack.  These all sound like horrible options to me.  So my extended
>> question is why is this happening at all?  I thought the whole point
>> of using Globus here is being able to submit jobs to a scheduler and
>> not having to worry about issues like this.  Are my expectations too
>> high?  Is it normal to have jobs lost track of? Is it not unusual to
>> have to monitor jobs across the various layers and manually intervene?
>
> The poll method in the sge.pm module is run periodically for all jobs that
> GRAM knows about. If you set GRAM to log DEBUG messages, you might see
> some info from that script to let you know what it is doing. See
> http://www.globus.org/toolkit/docs/5.2/5.2.0/gram5/admin/#id2483565
> for dealing with the log options.
>
>> Thanks very much for any guidance, suggestions, hints, or comments you
>> can provide.  Any tips you can give me that will help us get the
>> software stack stable will be very welcome!!
>>
>> --Brian O'Connor
>

Re: [gt-user] questions about SEG, lost SGE jobs, and general stability

Reply via email to